Electrical Engineering and Systems Science
See recent articles
- [1] arXiv:2406.05165 [pdf, html, other]
-
Title: Statistical AoI, Delay, and Error-Rate Bounded QoS Provisioning for Satellite-Terrestrial Integrated NetworksComments: arXiv admin note: text overlap with arXiv:2406.04685Subjects: Systems and Control (eess.SY)
Massive ultra-reliable and low latency communications (mURLLC) has emerged to support wireless time/error-sensitive services, which has attracted significant research attention while imposing several unprecedented challenges not encountered before. By leveraging the significant improvements in space-aerial-terrestrial resources for comprehensive 3D coverage, satellite-terrestrial integrated networks have been proposed to achieve rigorous and diverse quality-of-services (QoS) constraints of mURLLC. To effectively measure data freshness in satellite communications, recently, age of information (AoI) has surfaced as a novel QoS criterion for ensuring time-critical applications. Nevertheless, because of the complicated and dynamic nature of network environments, how to efficiently model multi-dimensional statistical QoS provisioning while upper-bounding peak AoI, delay, and error-rate for diverse network segments is still largely open. To address these issues, in this paper we propose statistical QoS provisioning schemes over satellite-terrestrial integrated networks in the finite blocklength regime. In particular, first we establish a satellite-terrestrial integrated wireless network architecture model and an AoI metric model. Second, we derive a series of fundamental statistical QoS metrics including peak-AoI bounded QoS exponent, delay-bounded QoS exponent, and error-rate bounded QoS exponent. Finally, we conduct a set of simulations to validate and evaluate our proposed statistical QoS provisioning schemes over satellite-terrestrial integrated networks.
- [2] arXiv:2406.05199 [pdf, html, other]
-
Title: XANE: eXplainable Acoustic Neural EmbeddingsSri Harsha Dumpala, Dushyant Sharma, Chandramouli Shama Sastri, Stanislav Kruchinin, James Fosburgh, Patrick A. NaylorSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
We present a novel method for extracting neural embeddings that model the background acoustics of a speech signal. The extracted embeddings are used to estimate specific parameters related to the background acoustic properties of the signal in a non-intrusive manner, which allows the embeddings to be explainable in terms of those parameters. We illustrate the value of these embeddings by performing clustering experiments on unseen test data and show that the proposed embeddings achieve a mean F1 score of 95.2\% for three different tasks, outperforming significantly the WavLM based signal embeddings. We also show that the proposed method can explain the embeddings by estimating 14 acoustic parameters characterizing the background acoustics, including reverberation and noise levels, overlapped speech detection, CODEC type detection and noise type detection with high accuracy and a real-time factor 17 times lower than an external baseline method.
- [3] arXiv:2406.05231 [pdf, html, other]
-
Title: The ULS23 Challenge: a Baseline Model and Benchmark Dataset for 3D Universal Lesion Segmentation in Computed TomographySubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Size measurements of tumor manifestations on follow-up CT examinations are crucial for evaluating treatment outcomes in cancer patients. Efficient lesion segmentation can speed up these radiological workflows. While numerous benchmarks and challenges address lesion segmentation in specific organs like the liver, kidneys, and lungs, the larger variety of lesion types encountered in clinical practice demands a more universal approach. To address this gap, we introduced the ULS23 benchmark for 3D universal lesion segmentation in chest-abdomen-pelvis CT examinations. The ULS23 training dataset contains 38,693 lesions across this region, including challenging pancreatic, colon and bone lesions. For evaluation purposes, we curated a dataset comprising 775 lesions from 284 patients. Each of these lesions was identified as a target lesion in a clinical context, ensuring diversity and clinical relevance within this dataset. The ULS23 benchmark is publicly accessible via this http URL, enabling researchers worldwide to assess the performance of their segmentation methods. Furthermore, we have developed and publicly released our baseline semi-supervised 3D lesion segmentation model. This model achieved an average Dice coefficient of 0.703 $\pm$ 0.240 on the challenge test set. We invite ongoing submissions to advance the development of future ULS models.
- [4] arXiv:2406.05239 [pdf, html, other]
-
Title: Risk-Aware Finite-Horizon Social Optimal Control of Mean-Field Coupled Linear-Quadratic SubsystemsComments: 6 pages, 1 figure, to be published in IEEE Control Systems Letters, for associated simulation code, see this https URLSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
We formulate and solve an optimal control problem with cooperative, mean-field coupled linear-quadratic subsystems and additional risk-aware costs depending on the covariance and skew of the disturbance. This problem quantifies the variability of the subsystem state energy rather than merely its expectation. In contrast to related work, we develop an alternative approach that illuminates a family of matrices with many analytical properties, which are useful for effectively extracting the mean-field coupled solution from a standard LQR solution.
- [5] arXiv:2406.05259 [pdf, html, other]
-
Title: A model of early word acquisition based on realistic-scale audiovisual naming eventsComments: 22 pages, 4 figures, journal article, submitted for reviewSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Infants gradually learn to parse continuous speech into words and connect names with objects, yet the mechanisms behind development of early word perception skills remain unknown. We studied the extent to which early words can be acquired through statistical learning from regularities in audiovisual sensory input. We simulated word learning in infants up to 12 months of age in a realistic setting, using a model that solely learns from statistical regularities in unannotated raw speech and pixel-level visual input. Crucially, the quantity of object naming events was carefully designed to match that accessible to infants of comparable ages. Results show that the model effectively learns to recognize words and associate them with corresponding visual objects, with a vocabulary growth rate comparable to that observed in infants. The findings support the viability of general statistical learning for early word perception, demonstrating how learning can operate without assuming any prior linguistic capabilities.
- [6] arXiv:2406.05286 [pdf, html, other]
-
Title: Signal processing algorithm effective for sound quality of hearing loss simulatorsComments: This paper has been accepted for publication in Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Hearing loss (HL) simulators, which allow normal hearing (NH) listeners to experience HL, have been used in speech intelligibility experiments, but not in sound quality experiments due to perceptible distortion. If they produced less distortion, they might be useful for NH listeners to evaluate the sound quality of, for example, hearing aids. We conducted perceptual sound quality experiments to compare the Cambridge version of HL simulator (CamHLS) and the Wakayama version of the HL simulator (WHIS), which has the two algorithms of filterbank analysis synthesis (FBAS) and direct time-varying filter (DTVF). The experimental results showed that WHIS with DTVF produces less perceptible distortion in speech sounds than CamHLS and WHIS with FBAS, even when the nonlinear process is working. This advantage is mainly due to the use of the DTVF algorithm, which could be applied to various signal synthesis applications with filterbank analysis.
- [7] arXiv:2406.05298 [pdf, html, other]
-
Title: Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech SynthesisSubjects: Audio and Speech Processing (eess.AS)
Historically, most speech models in machine-learning have used the mel-spectrogram as a speech representation. Recently, discrete audio tokens produced by neural audio codecs have become a popular alternate speech representation for speech synthesis tasks such as text-to-speech (TTS). However, the data distribution produced by such codecs is too complex for some TTS models to predict, hence requiring large autoregressive models to get reasonable quality. Typical audio codecs compress and reconstruct the time-domain audio signal. We propose a spectral codec which compresses the mel-spectrogram and reconstructs the time-domain audio signal. A study of objective audio quality metrics suggests that our spectral codec has comparable perceptual quality to equivalent audio codecs. Furthermore, non-autoregressive TTS models trained with the proposed spectral codec generate audio with significantly higher quality than when trained with mel-spectrograms or audio codecs.
- [8] arXiv:2406.05300 [pdf, html, other]
-
Title: Harnessing Multimodal Sensing for Multi-user Beamforming in mmWave SystemsSubjects: Signal Processing (eess.SP)
Sensor-aided beamforming reduces the overheads associated with beam training in millimeter-wave (mmWave) multi-input-multi-output (MIMO) communication systems. Most prior work, though, neglects the challenges associated with establishing multi-user (MU) communication links in mmWave MIMO systems. In this paper, we propose a new framework for sensor-aided beam training in MU mmWave MIMO system. We leverage the beamspace representation of the channel that contains only the angles-of-departure (AoDs) of the channel's significant multipath components. We show that a deep neural network (DNN)-based multimodal sensor fusion framework can estimate the beamspace representation of the channel using sensor data. To aid the DNN training, we introduce a novel supervised soft-contrastive loss (SSCL) function that leverages the inherent similarity between channels to extract similar features from the sensor data for similar channels. Finally, we design an MU beamforming strategy that uses the estimated beamspaces of the channels to select analog precoders for all users in a way that prevents transmission to multiple users over the same directions. Compared to the baseline, our approach achieves more than 4$\times$ improvement in the median sum-spectral efficiency (SE) at 42 dBm equivalent isotropic radiated power (EIRP) with 4 active users. This demonstrates that sensor data can provide more channel information than previously explored, with significant implications for machine learning (ML)-based communication and sensing systems.
- [9] arXiv:2406.05301 [pdf, html, other]
-
Title: Active Islanding Detection Using Pulse Compression ProbingComments: Pending Publication at 2024 IEEE PESGMSubjects: Systems and Control (eess.SY)
An islanding detection scheme is developed using pulse compression probing (PCP). A state space system realization is taken from the probing output. The nu-gap metric is applied to compare the measured system to fully intact system and classify it as islanded, or grid-connected. The designed detector displays fast operation, accurate islanding detection results under varying grid condition, and is physically implementable at the terminals of an inverter. The method is verified via electro-magnetic transient (EMT) simulation on a modified IEEE 34 bus test system with randomized loads and simultaneous probing at three independent solar plants, with the probing signal directly implemented into the logic of a switching inverter model.
- [10] arXiv:2406.05312 [pdf, html, other]
-
Title: Deep convolutional demosaicking network for multispectral polarization filter arraySubjects: Image and Video Processing (eess.IV)
To address the demosaicking problem in multispectral polarization filter array (MSPFA) imaging, we propose a multispectral polarization demosaicking network (MSPDNet) that improves image reconstruction accuracy. Imaging with a multispectral polarization filter array acquires multispectral polarization information in a snapshot. The full-resolution multispectral polarization image must be reconstructed from a mosaic image. In the proposed method, a sparse image in which pixel values of the same channel are extracted from a mosaic image is used as input to MSPDNet. Missing pixels are interpolated by learning spatial and wavelength correlations from the observed pixels in the mosaic image. Moreover, by using 3D convolution, features are extracted at each convolution layer, and by deepening the network, even detailed features of the multispectral polarization image can be learned. Experimental results show that MSPDNet can reconstruct multi-wavelength and multi-polarization angle information with high accuracy in terms of peak signal-to-noise ratio (PSNR) evaluation and visual quality, indicating the effectiveness of the proposed method compared to other methods.
- [11] arXiv:2406.05314 [pdf, html, other]
-
Title: Relational Proxy Loss for Audio-Text based Keyword SpottingComments: 5 pages, 2 figures, Accepted by Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
In recent years, there has been an increasing focus on user convenience, leading to increased interest in text-based keyword enrollment systems for keyword spotting (KWS). Since the system utilizes text input during the enrollment phase and audio input during actual usage, we call this task audio-text based KWS. To enable this task, both acoustic and text encoders are typically trained using deep metric learning loss functions, such as triplet- and proxy-based losses. This study aims to improve existing methods by leveraging the structural relations within acoustic embeddings and within text embeddings. Unlike previous studies that only compare acoustic and text embeddings on a point-to-point basis, our approach focuses on the relational structures within the embedding space by introducing the concept of Relational Proxy Loss (RPL). By incorporating RPL, we demonstrated improved performance on the Wall Street Journal (WSJ) corpus.
- [12] arXiv:2406.05325 [pdf, html, other]
-
Title: LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer GuidanceComments: Accepted by Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Any-to-any singing voice conversion (SVC) is an interesting audio editing technique, aiming to convert the singing voice of one singer into that of another, given only a few seconds of singing data. However, during the conversion process, the issue of timbre leakage is inevitable: the converted singing voice still sounds like the original singer's voice. To tackle this, we propose a latent diffusion model for SVC (LDM-SVC) in this work, which attempts to perform SVC in the latent space using an LDM. We pretrain a variational autoencoder structure using the noted open-source So-VITS-SVC project based on the VITS framework, which is then used for the LDM training. Besides, we propose a singer guidance training method based on classifier-free guidance to further suppress the timbre of the original singer. Experimental results show the superiority of the proposed method over previous works in both subjective and objective evaluations of timbre similarity.
- [13] arXiv:2406.05339 [pdf, html, other]
-
Title: To what extent can ASV systems naturally defend against spoofing attacks?Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, Joon Son ChungComments: 5 pages, 3 figures, 3 tables, Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
The current automatic speaker verification (ASV) task involves making binary decisions on two types of trials: target and non-target. However, emerging advancements in speech generation technology pose significant threats to the reliability of ASV systems. This study investigates whether ASV effortlessly acquires robustness against spoofing attacks (i.e., zero-shot capability) by systematically exploring diverse ASV systems and spoofing attacks, ranging from traditional to cutting-edge techniques. Through extensive analyses conducted on eight distinct ASV systems and 29 spoofing attack systems, we demonstrate that the evolution of ASV inherently incorporates defense mechanisms against spoofing attacks. Nevertheless, our findings also underscore that the advancement of spoofing attacks far outpaces that of ASV systems, hence necessitating further research on spoofing-robust ASV methodologies.
- [14] arXiv:2406.05341 [pdf, html, other]
-
Title: Diversifying and Expanding Frequency-Adaptive Convolution Kernels for Sound Event DetectionComments: Accepted to INTERSPEECH 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Frequency dynamic convolution (FDY conv) has shown the state-of-the-art performance in sound event detection (SED) using frequency-adaptive kernels obtained by frequency-varying combination of basis kernels. However, FDY conv lacks an explicit mean to diversify frequency-adaptive kernels, potentially limiting the performance. In addition, size of basis kernels is limited while time-frequency patterns span larger spectro-temporal range. Therefore, we propose dilated frequency dynamic convolution (DFD conv) which diversifies and expands frequency-adaptive kernels by introducing different dilation sizes to basis kernels. Experiments showed advantages of varying dilation sizes along frequency dimension, and analysis on attention weight variance proved dilated basis kernels are effectively diversified. By adapting class-wise median filter with intersection-based F1 score, proposed DFD-CRNN outperforms FDY-CRNN by 3.12% in terms of polyphonic sound detection score (PSDS).
- [15] arXiv:2406.05342 [pdf, other]
-
Title: Compensation for reactive power and harmonic current drawn by a non-linear load in a pv-micro hydro gridRaj Krishna Nepal, Bibek Khanal, Sanket Khatiwada, Nirajan Bhandari, Bishal Rijal, Raisha Karmacharya, Ajay ThapaComments: 5 pages, 21 figures, submitted on IEEE powercon 2024 conferenceSubjects: Systems and Control (eess.SY)
This paper presents a simulation approach to enhance the power quality of a PV-micro hydro grid supplying both linear consumer load and non-linear industrial load by integrating Shunt Active Power Filter (SAPF), utilizing instantaneous PQ theory and hysteresis current control band logic. The non-linear load draws reactive power and harmonic current from the source thereby affecting the power quality. The integration of the SAPF at the point of common coupling (PCC) offers reactive power and harmonic current compensation, ensuring that the current supply to the grid remains nearly sinusoidal and proportional to the active power. By injecting equal and opposite harmonic components, the SAPF effectively reduces Total Harmonic Distortion (THD) from 7% to 2.96%, thereby enhancing the overall power quality of the PV-micro hydro grid system.
- [16] arXiv:2406.05350 [pdf, html, other]
-
Title: Exploiting Monotonicity to Design an Adaptive PI Passivity-Based Controller for a Fuel-Cell SystemComments: 11 pages, 8 FigsSubjects: Systems and Control (eess.SY)
We present a controller for a power electronic system composed of a fuel cell (FC) connected to a boost converter which feeds a resistive load. The controller aims to regulate the output voltage of the converter regardless of the uncertainty of the load. Leveraging the monotonicity feature of the fuel cell polarization curve we prove that the nonlinear system can be controlled by means of a passivity-based proportional-integral approach. We afterward extend the result to an adaptive version, allowing the controller to deal with parameter uncertainties, such as inductor parasitic resistance, load, and FC polarization curve parameters. This adaptive design is based on an indirect control approach with online parameter identification performed by a ``hybrid'' estimator which combines two techniques: the gradient-descent and immersion-and-invariance algorithms. The overall system is proved to be stable with the output voltage regulated to its reference. Experimental results validate our proposal under two real-life scenarios: pulsating load and output voltage reference changes.
- [17] arXiv:2406.05359 [pdf, html, other]
-
Title: Towards Lightweight Speaker Verification via Adaptive Neural Network QuantizationComments: submitted to IEEE/ACM Transactions on Audio Speech and Language Processing (Under Review)Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Modern speaker verification (SV) systems typically demand expensive storage and computing resources, thereby hindering their deployment on mobile devices. In this paper, we explore adaptive neural network quantization for lightweight speaker verification. Firstly, we propose a novel adaptive uniform precision quantization method which enables the dynamic generation of quantization centroids customized for each network layer based on k-means clustering. By applying it to the pre-trained SV systems, we obtain a series of quantized variants with different bit widths. To enhance the performance of low-bit quantized models, a mixed precision quantization algorithm along with a multi-stage fine-tuning (MSFT) strategy is further introduced. Unlike uniform precision quantization, mixed precision approach allows for the assignment of varying bit widths to different network layers. When bit combination is determined, MSFT is employed to progressively quantize and fine-tune network in a specific order. Finally, we design two distinct binary quantization schemes to mitigate performance degradation of 1-bit quantized models: the static and adaptive quantizers. Experiments on VoxCeleb demonstrate that lossless 4-bit uniform precision quantization is achieved on both ResNets and DF-ResNets, yielding a promising compression ratio of around 8. Moreover, compared to uniform precision approach, mixed precision quantization not only obtains additional performance improvements with a similar model size but also offers the flexibility to generate bit combination for any desirable model size. In addition, our suggested 1-bit quantization schemes remarkably boost the performance of binarized models. Finally, a thorough comparison with existing lightweight SV systems reveals that our proposed models outperform all previous methods by a large margin across various model size ranges.
- [18] arXiv:2406.05378 [pdf, html, other]
-
Title: Practical Explicit-time Stabilization of a Proportional Control SystemSubjects: Systems and Control (eess.SY)
Proportional control can be realized directly through the amplification of analog signals, and it also has the advantage of easy tuning parameters in digital signal control. However, it is difficult for the proportional control to preset the upper bound of settling time. To address this problem, a novel practical explicit-time control method is proposed. In bounded initial condition, this method makes this system error converge to a predefined neighborhood of zero within an explicit time. More specifically, the initial condition set and conditionally stable set are solved by practical explicit-time stabilization theorem. Based on that, a proportional feedback control is founded to achieve practical conditional fixed-time stability.
- [19] arXiv:2406.05389 [pdf, other]
-
Title: A Deep Learning-Augmented Stand-off Radar Scheme for Rapidly Detecting Tree DefectsJiwei Qian, Yee Hui Lee, Kaixuan Cheng, Qiqi Dai, Mohamed Lokman Mohd Yusof, Daryl Lee, Abdulkadir C. YucelComments: Accepted and to be published in IEEE Transactions on Geoscience and Remote SensingSubjects: Signal Processing (eess.SP); Image and Video Processing (eess.IV)
Tree defect detection is crucial for the structural health screening of trees. Existing nondestructive testing (NDT) techniques for tree defect detection require time-consuming and labor-intensive measurement campaigns. This discourages their application for the routine structural health screening of whole populations of managed urban trees. To address this issue, this study proposes a deep-learning augmented stand-off radar scheme for contactless scanning of tree trunks and rapid detection of tree defects. In this scheme, the antenna is moved along a straight trajectory at a distance from the tree trunk to obtain the trunk's B-scan. The obtained raw B-scan is then processed by a signal-processing framework specifically developed for revealing the scattering signatures of defects in B-scan, which achieves a 30 dB and 22 dB increase in the signal-to-clutter and noise ratio of the measurement data of tree trunk samples and living trees, respectively. Finally, the processed B-scan is input into a multilevel feature fusion neural network particularly designed for extracting the signature of the defect in the processed B-scan in real time. The developed scheme's applications to the detection of defects in real fresh-cut tree trunks show that the stand-off radar scheme can detect tree defects with 96% accuracy. This stand-off radar scheme is the first contactless NDT technique for tree defect detection while operated on a straight trajectory and potentially can be integrated into the routine tree inspection workflow which is part of urban tree management.
- [20] arXiv:2406.05401 [pdf, html, other]
-
Title: Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speechComments: 5 pages, 2 figures. Final version, accepted to Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Human-Computer Interaction (cs.HC); Sound (cs.SD)
Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling improves probabilistic NAR TTS approaches, especially for spontaneous speech. Please see this https URL for audio and resources.
- [21] arXiv:2406.05421 [pdf, html, other]
-
Title: 3D MRI Synthesis with Slice-Based Latent Diffusion Models: Improving Tumor Segmentation Tasks in Data-Scarce RegimesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Despite the increasing use of deep learning in medical image segmentation, the limited availability of annotated training data remains a major challenge due to the time-consuming data acquisition and privacy regulations. In the context of segmentation tasks, providing both medical images and their corresponding target masks is essential. However, conventional data augmentation approaches mainly focus on image synthesis. In this study, we propose a novel slice-based latent diffusion architecture designed to address the complexities of volumetric data generation in a slice-by-slice fashion. This approach extends the joint distribution modeling of medical images and their associated masks, allowing a simultaneous generation of both under data-scarce regimes. Our approach mitigates the computational complexity and memory expensiveness typically associated with diffusion models. Furthermore, our architecture can be conditioned by tumor characteristics, including size, shape, and relative position, thereby providing a diverse range of tumor variations. Experiments on a segmentation task using the BRATS2022 confirm the effectiveness of the synthesized volumes and masks for data augmentation.
- [22] arXiv:2406.05437 [pdf, html, other]
-
Title: From Analog to Digital: Multi-Order Digital Joint Coding-Modulation for Semantic CommunicationSubjects: Signal Processing (eess.SP)
Recent studies in joint source-channel coding (JSCC) have fostered a fresh paradigm in end-to-end semantic communication. Despite notable performance achievements, present initiatives in building semantic communication systems primarily hinge on the transmission of continuous channel symbols, thus presenting challenges in compatibility with established digital systems. In this paper, we introduce a novel approach to address this challenge by developing a multi-order digital joint coding-modulation (MDJCM) scheme for semantic communications. Initially, we construct a digital semantic communication system by integrating a multi-order modulation/demodulation module into a nonlinear transform source-channel coding (NTSCC) framework. Recognizing the non-differentiable nature of modulation/demodulation, we propose a novel substitution training strategy. Herein, we treat modulation/demodulation as a constrained quantization process and introduce scaling operations alongside manually crafted noise to approximate this process. As a result, employing this approximation in training semantic communication systems can be deployed in practical modulation/demodulation scenarios with superior performance. Additionally, we demonstrate the equivalence by analyzing the involved probability distribution. Moreover, to further upgrade the performance, we develop a hierarchical dimension-reduction strategy to provide a gradual information extraction process. Extensive experimental evaluations demonstrate the superiority of our proposed method over existing digital and non-digital JSCC techniques.
- [23] arXiv:2406.05440 [pdf, html, other]
-
Title: Finite-Sample Identification of Linear Regression Models with Residual-Permuted SumsSubjects: Systems and Control (eess.SY); Statistics Theory (math.ST); Machine Learning (stat.ML)
This letter studies a distribution-free, finite-sample data perturbation (DP) method, the Residual-Permuted Sums (RPS), which is an alternative of the Sign-Perturbed Sums (SPS) algorithm, to construct confidence regions. While SPS assumes independent (but potentially time-varying) noise terms which are symmetric about zero, RPS gets rid of the symmetricity assumption, but assumes i.i.d. noises. The main idea is that RPS permutes the residuals instead of perturbing their signs. This letter introduces RPS in a flexible way, which allows various design-choices. RPS has exact finite sample coverage probabilities and we provide the first proof that these permutation-based confidence regions are uniformly strongly consistent under general assumptions. This means that the RPS regions almost surely shrink around the true parameters as the sample size increases. The ellipsoidal outer-approximation (EOA) of SPS is also extended to RPS, and the effectiveness of RPS is validated by numerical experiments, as well.
- [24] arXiv:2406.05444 [pdf, html, other]
-
Title: A Generalized Pointing Error Model for FSO Links with Fixed-Wing UAVs for 6G: Analysis and Trajectory OptimizationComments: 14 pages, 12 figures, under revision; IEEE Transactions on Wireless CommunicationsSubjects: Systems and Control (eess.SY)
Free-space optical (FSO) communication is a promising solution to support wireless backhaul links in emerging 6G non-terrestrial networks. At the link level, pointing errors in FSO links can significantly impact capacity, making accurate modeling of these errors essential for both assessing and enhancing communication performance. In this paper, we introduce a novel model for FSO pointing errors in unmanned aerial vehicles (UAVs) that incorporates three-dimensional (3D) jitter, including roll, pitch, and yaw angle jittering. We derive a probability density function for the pointing error angle based on the relative position and posture of the UAV to the ground station. This model is then integrated into a trajectory optimization problem designed to maximize energy efficiency while meeting constraints on speed, acceleration, and elevation angle. Our proposed optimization method significantly improves energy efficiency by adjusting the UAV's flight trajectory to minimize exposure to directions highly affected by jitter. The simulation results emphasize the importance of using UAV-specific 3D jitter models in achieving accurate performance measurements and effective system optimization in FSO communication networks. Utilizing our generalized model, the optimized trajectories achieve up to 11.8 percent higher energy efficiency compared to those derived from conventional Gaussian pointing error models.
- [25] arXiv:2406.05452 [pdf, html, other]
-
Title: Near-Field Channel Estimation for Extremely Large-Scale Terahertz CommunicationsSubjects: Signal Processing (eess.SP); Information Theory (cs.IT)
Future Terahertz communications exhibit significant potential in accommodating ultra-high-rate services. Employing extremely large-scale array antennas is a key approach to realize this potential, as they can harness substantial beamforming gains to overcome the severe path loss and leverage the electromagnetic advantages in the near field. This paper proposes novel estimation methods designed to enhance efficiency in Terahertz widely-spaced multi-subarray (WSMS) systems. Initially, we introduce three sparse channel representation methods: polar-domain representation (PD-R), multi-angular-domain representation (MAD-R), and two-dimensional polar-angular-domain representation (2D-PAD-R). Each method is meticulously developed for near-field WSMS channels, capitalizing on their sparsity characteristics. Building on this, we propose four estimation frameworks using the sparse recovery theory: polar-domain estimation (PD-E), multi-angular-domain estimation (MAD-E), two-stage polar-angular-domain estimation (TS-PAD-E), and two-dimensional polar-angular-domain estimation (2D-PAD-E). Particularly, 2D-PAD-E, integrating a 2D dictionary process, and TS-PAD-E, with its sequential approach to angle and distance estimation, stand out as particularly effective for near-field angle-distance estimation, enabling decoupled calculation of these parameters. Overall, these frameworks provide versatile and efficient solutions for WSMS channel estimation, balancing low complexity with high-performance outcomes. Additionally, they represent a fresh perspective on near-field signal processing.
- [26] arXiv:2406.05499 [pdf, html, other]
-
Title: A Pixel-based Reconfigurable Antenna Design for Fluid Antenna SystemsJichen Zhang, Junhui Rao, Zhaoyang Ming, Zan Li, Chi-Yuk Chiu, Kai-Kit Wong, Kin-Fai Tong, Ross MurchComments: 13 pages, 16 figuresSubjects: Signal Processing (eess.SP)
Fluid Antenna Systems (FASs) have recently been proposed for enhancing the performance of wireless communication. Previous antenna designs to meet the requirements of FAS have been based on mechanically movable or liquid antennas and therefore have limited reconfiguration speeds. In this paper, we propose a design for a pixel-based reconfigurable antenna (PRA) that meets the requirements of FAS and the required switching speed. It can provide 12 FAS ports across 1/2 wavelength and consists of an E-slot patch antenna and an upper reconfigurable pixel layer with 6 RF switches. Simulation and experimental results from a prototype operating at 2.5 GHz demonstrate that the design can meet the requirements of FAS including port correlation with matched impedance.
- [27] arXiv:2406.05529 [pdf, html, other]
-
Title: Automatic modulation classification for MIMO system based on the mutual information feature extractionN. Ussipov, S. Akhtanov, Z. Zhanabaev, D. Turlykozhayeva, B. Karibayev, T. Namazbayev, D. Almen, A. Akhmetali, X. TangComments: IEEE Access (2024)Subjects: Signal Processing (eess.SP)
Automatic Modulation Classification (AMC) is an essential technology that is widely applied into various communications scenarios. In recent years, many Machine Learning and Deep-Learning methods have been introduced into AMC, and a lot of them apply different approaches to eliminate interference in complex Multiple-Input and Multiple-Output (MIMO) signals and improve classification performance. However, in practical communication systems, the perfect elimination of MIMO signal interference is impossible, and therefore classification performance suffers. In this paper, we propose a new AMC algorithm for MIMO system based on mutual information (MI) features extraction, which does not require a large amount of training data and the elimination of MIMO signal interference. In this approach, features based on mutual information are extracted using In-Phase and Quadrature (IQ) constellation diagrams of MIMO signals, which have not been explored previously. Our method can be effective since mutual information considers the interdependencies among variables and measures how much information about one variable reduces uncertainty about another, providing a valuable perspective for extracting higher-level and interesting features from the data. The effectiveness of our method is evaluated on several model and real-world datasets, and its applicability is proven.
- [28] arXiv:2406.05542 [pdf, html, other]
-
Title: The Development of the Reproductive Healthcare Equity Algorithm (RHEA)Subjects: Systems and Control (eess.SY)
After the repeal of Roe vs. Wade in June 2022, women face long-distance travel across state lines to access abortion care. For women who also face socioeconomic hardship, travel for abortion care is a significant burden. To ease this burden, abortion access nonprofits are funding and/or supplying transportation to abortion clinics. However, due to the uneven distribution of demand and supply for abortions, these nonprofits do not have efficient logistical operations. As a result, low-income, underserved women may not have access to adequate reproductive healthcare, thus widening healthcare inequity gaps. Nonprofits may also risk not serving the needs of vulnerable women without access to adequate reproductive healthcare, and in doing so, waste resources, money, and volunteer hours. To address these challenges, we create an interactive, web-based planning tool, the Reproductive Healthcare Equity Algorithm (RHEA), to guide nonprofits in strategically allocating resources and serving demand. RHEA leverages an optimization model to determine the maximum flow and minimum transportation cost to route women across a network of counties and abortion clinics, subject to transportation supply, budget, and time constraints for one day of operations for a nonprofit. In doing so, we collaborate with abortion access nonprofits to cater our model design and interface development to their needs and considerations. Ultimately, we seek to optimize resource allocation for nonprofits providing abortion care logistics and improve abortion access for low-income, underserved women.
- [29] arXiv:2406.05549 [pdf, html, other]
-
Title: Fractal OAM Generation and Detection SchemesComments: 15 pages, 20 figuresJournal-ref: IEEE Journal on Selected Areas in Communications, vol. 42, no. 6, pp. 1598-1612, June 2024Subjects: Signal Processing (eess.SP)
Orbital angular momentum (OAM) carried electromagnetic waves have the potential to improve spectrum efficiency in optical and radio-frequency communications due to the orthogonal wavefronts of different OAM modes. However, OAM beams are vortically hollow and divergent, which significantly decreases the capacity of OAM transmissions. In addition, unaligned transceivers in OAM transmissions can result in a high bit error rate (BER). The Talbot effect is a self-imaging phenomenon that can be used to generate optical or radio-frequency OAM beams with periodic repeating structures at multiples of a certain distance along the propagation direction. These periodic structures make it unnecessary for the transceiver antennas to be perfectly aligned and can also alleviate the hollow divergence of OAM beams. In this paper, we propose Talbot-effect-based fractal OAM generation and detection schemes using a uniform circular array (UCA) to significantly improve capacity and BER performance in unaligned OAM transmissions. We first provide a brief overview of fractal OAM. Then, we propose the fractal OAM beam generation and detection schemes. Numerical analysis and simulations verify the effectiveness of our proposed fractal OAM generation scheme and also demonstrate improved capacity and BER performance compared to normal OAM transmissions. We also analyze how the receive UCA radius and the distance between the UCAs impact the capacity and BER performances.
- [30] arXiv:2406.05551 [pdf, html, other]
-
Title: Autoregressive Diffusion Transformer for Text-to-Speech SynthesisSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Audio language models have recently emerged as a promising approach for various audio generation tasks, relying on audio tokenizers to encode waveforms into sequences of discrete symbols. Audio tokenization often poses a necessary compromise between code bitrate and reconstruction accuracy. When dealing with low-bitrate audio codes, language models are constrained to process only a subset of the information embedded in the audio, which in turn restricts their generative capabilities. To circumvent these issues, we propose encoding audio as vector sequences in continuous space $\mathbb R^d$ and autoregressively generating these sequences using a decoder-only diffusion transformer (ARDiT). Our findings indicate that ARDiT excels in zero-shot text-to-speech and exhibits performance that compares to or even surpasses that of state-of-the-art models. High-bitrate continuous speech representation enables almost flawless reconstruction, allowing our model to achieve nearly perfect speech editing. Our experiments reveal that employing Integral Kullback-Leibler (IKL) divergence for distillation at each autoregressive step significantly boosts the perceived quality of the samples. Simultaneously, it condenses the iterative sampling process of the diffusion model into a single step. Furthermore, ARDiT can be trained to predict several continuous vectors in one step, significantly reducing latency during sampling. Impressively, one of our models can generate $170$ ms of $24$ kHz speech per evaluation step with minimal degradation in performance. Audio samples are available at this http URL .
- [31] arXiv:2406.05552 [pdf, html, other]
-
Title: Joint Reflection and Power Splitting Optimization for RIS-assisted OAM-SWIPTComments: 6 pages, 5 figuresJournal-ref: GLOBECOM 2022 - 2022 IEEE Global Communications Conference, Rio de Janeiro, Brazil, 2022, pp. 1073-1078Subjects: Signal Processing (eess.SP)
Simultaneous wireless information and power transfer (SWIPT) can enhance the spectrum and power efficiencies of wireless communications networks. Line-of-sight (LOS) transmission is a typical SWIPT scenario. However, the strong channel correlation limits the spectrum and energy efficiencies of SWIPT in the LOS channel. Due to the orthogonal wavefronts, orbital angular momentum (OAM) waves can facilitate the SWIPT in LOS channels. With the assistance of the reconfigurable intelligent surface (RIS), both the energy efficiency and capacity can be further improved for the OAM-SWIPT systems. In this paper, we model the RIS-assisted OAM-SWIPT transmission and derive the optimal reflection coefficients and power splitting ratio for it. We first give the system and channel models. Then, we propose the transmission scheme. Based on the transmission scheme, we formulate the capacity and energy harvesting (EH) trade-off problem. We solve the problem by developing an alternating optimization algorithm. Simulations validate the capacity and EH enhancements brought by the RIS for OAM-SWIPT.
- [32] arXiv:2406.05555 [pdf, html, other]
-
Title: OAM-SWIPT for IoE-Driven 6GComments: 7 pages, 6 figuresJournal-ref: in IEEE Communications Magazine, vol. 60, no. 3, pp. 19-25, March 2022Subjects: Signal Processing (eess.SP)
Simultaneous wireless information and power transfer (SWIPT), which achieves both wireless energy transfer (WET) and information transfer, is an attractive technique for future Internet of Everything (IoE) in the sixth-generation (6G) mobile communications. With SWIPT, battery-less IoE devices can be powered while communicating with other devices. Line-of-sight (LOS) RF transmission and near-field inductive coupling based transmission are typical SWIPT scenarios, which are both LOS channels and without enough degree of freedom for high spectrum efficiency as well as high energy efficiency. Due to the orthogonal wavefronts, orbital angular momentum (OAM) can facilitate the SWIPT in LOS channels. In this article, we introduce the OAM-based SWIPT as well as discuss some basic advantages and challenges for it. After introducing the OAM-based SWIPT for IoE, we first propose an OAM-based SWIPT system model with the OAM-modes assisted dynamic power splitting (DPS). Then, four basic advantages regarding the OAM-based SWIPT are reviewed with some numerical analyses for further demonstrating the advantages. Next, four challenges regarding integrating OAM into SWIPT and possible solutions are discussed. OAM technology provides multiple orthogonal streams to increase both spectrum and energy efficiencies for SWIPT, thus creating many opportunities for future WET and SWIPT researches.
- [33] arXiv:2406.05557 [pdf, html, other]
-
Title: Modeling and Performance Analysis of OAM-NFC SystemsComments: 14 pages, 13 figuresJournal-ref: in IEEE Transactions on Communications, vol. 69, no. 12, pp. 7986-8001, Dec. 2021Subjects: Signal Processing (eess.SP)
Due to its low energy consumption and simplicity, near field communication (NFC) has been extensively used in various short-range transmission scenarios, for example, proximity payment and NFC entrance guard. However, the low data rate of NFC limits its application in high rate demanded scenarios, such as high-resolution fingerprint identification and streaming media transmission as well as the future promising high rate indoor communications among pads, phones, and laptops. In this paper, we model and analyze the performance of the orbital angular momentum based NFC (OAM-NFC) system, which can significantly increase the capacity of NFC. We first give the OAM system model. With coils circularly equipped at the transmitter and receiver, OAM-NFC signals can be transmitted, received, and detected. Then, we develop the OAM-NFC generation and detection schemes for NFC multiplexing transmission. We also analyze the OAM-NFC channel capacity and compare it with those of single-input-single-output (SISO) as well as multi-input-multi-output (MIMO) NFC. Simulation results validate the feasibility and capacity enhancement of our proposed OAM-NFC system. How different variables, such as the transceiver misalignment, the numbers of transceiver coils, and transceiver distance, impact the OAM-NFC capacity are also analyzed.
- [34] arXiv:2406.05580 [pdf, html, other]
-
Title: Adaptive Output Tracking Control with Reference Model System UncertaintiesSubjects: Systems and Control (eess.SY)
This paper develops adaptive output tracking control schemes with the reference output signal generated from an unknown reference system whose output derivatives are also unknown. To deal with such reference system uncertainties, an expanded adaptive controller structure is developed to include a parametrized estimator of the equivalent reference input signal. Without using the knowledge of the reference system transfer function and equivalent input, both are the critical components of a traditional model reference adaptive control (MRAC) scheme, the developed new MRAC schemes designed for various cases plant and reference model uncertainties, ensure completely parametrized error systems and stable parameter adaptation, leading to the desired closed-loop system stability and asymptotic output tracking.
- [35] arXiv:2406.05586 [pdf, html, other]
-
Title: Enhanced Flight Envelope Protection: A Novel Reinforcement Learning ApproachSubjects: Systems and Control (eess.SY)
This paper introduces a flight envelope protection algorithm on a longitudinal axis that leverages reinforcement learning (RL). By considering limits on variables such as angle of attack, load factor, and pitch rate, the algorithm counteracts excessive pilot or control commands with restoring actions. Unlike traditional methods requiring manual tuning, RL facilitates the approximation of complex functions within the trained model, streamlining the design process. This study demonstrates the promising results of RL in enhancing flight envelope protection, offering a novel and easy-to-scale method for safety-ensured flight.
- [36] arXiv:2406.05610 [pdf, html, other]
-
Title: Statistical Delay and Error-Rate Bounded QoS Provisioning for AoI-Driven 6G Satellite-Terrestrial Integrated Networks Using FBCSubjects: Systems and Control (eess.SY)
As one of the pivotal enablers for 6G, satellite-terrestrial integrated networks have emerged as a solution to provide extensive connectivity and comprehensive 3D coverage across the spatial-aerial-terrestrial domains to cater to the specific requirements of 6G massive ultra-reliable and low latency communications (mURLLC) applications, while upholding a diverse set of stringent quality-of-service (QoS) requirements. In the context of mURLLC satellite services, the concept of data freshness assumes paramount significance, as the use of outdated data may lead to unforeseeable or even catastrophic consequences. To effectively gauge the degree of data freshness for satellite-terrestrial integrated communications, the notion of age of information (AoI) has recently emerged as a novel dimension of QoS metrics to support time-sensitive applications. Nonetheless, the research efforts directed towards defining novel diverse statistical QoS provisioning metrics, including AoI, delay, and reliability, while accommodating the dynamic and intricate nature of satellite-terrestrial integrated environments, are still in their infancy. To overcome these problems, in this paper we develop analytical modeling formulations/frameworks for statistical QoS over 6G satellite-terrestrial integrated networks using hybrid automatic repeat request with incremental redundancy (HARQ-IR) in the finite blocklength regime. In particular, first we design the satellite-terrestrial integrated wireless network architecture model and AoI metric model. Second, we characterize the peak-AoI bounded QoS metric using HARQ-IR protocol. Third, we develop a set of new fundamental statistical QoS metrics in the finite blocklength regime. Finally, extensive simulations have been conducted to assess and analyze the efficacy of statistical QoS schemes for satellite-terrestrial integrated networks.
- [37] arXiv:2406.05647 [pdf, html, other]
-
Title: Sustainable Wireless Networks via Reconfigurable Intelligent Surfaces (RISs): Overview of the ETSI ISG RISRuiqi Liu, Shuang Zheng, Qingqing Wu, Yifan Jiang, Nan Zhang, Yuanwei Liu, Marco Di Renzo, and George C. AlexandropoulosComments: 7 pages, 5 figures, submitted to an IEEE MagazineSubjects: Signal Processing (eess.SP); Emerging Technologies (cs.ET)
Reconfigurable Intelligent Surfaces (RISs) are a novel form of ultra-low power devices that are capable to increase the communication data rates as well as the cell coverage in a cost- and energy-efficient way. This is attributed to their programmable operation that enables them to dynamically manipulate the wireless propagation environment, a feature that has lately inspired numerous research investigations and applications. To pave the way to the formal standardization of RISs, the European Telecommunications Standards Institute (ETSI) launched the Industry Specification Group (ISG) on the RIS technology in September 2021. This article provides a comprehensive overview of the status of the work conducted by the ETSI ISG RIS, covering typical deployment scenarios of reconfigurable metasurfaces, use cases and operating applications, requirements, emerging hardware architectures and operating modes, as well as the latest insights regarding future directions of RISs and the resulting smart wireless environments.
- [38] arXiv:2406.05652 [pdf, html, other]
-
Title: Distributed Combinatorial Optimization of Downlink User Assignment in mmWave Cell-free Massive MIMO Using Graph Neural NetworksBile Peng, Bihan Guo, Karl-Ludwig Besser, Luca Kunz, Ramprasad Raghunath, Anke Schmeink, Eduard A Jorswieck, Giuseppe Caire, H. Vincent PoorSubjects: Signal Processing (eess.SP)
Millimeter wave (mmWave) cell-free massive MIMO (CF mMIMO) is a promising solution for future wireless communications. However, its optimization is non-trivial due to the challenging channel characteristics. We show that mmWave CF mMIMO optimization is largely an assignment problem between access points (APs) and users due to the high path loss of mmWave channels, the limited output power of the amplifier, and the almost orthogonal channels between users given a large number of AP antennas. The combinatorial nature of the assignment problem, the requirement for scalability, and the distributed implementation of CF mMIMO make this problem difficult. In this work, we propose an unsupervised machine learning (ML) enabled solution. In particular, a graph neural network (GNN) customized for scalability and distributed implementation is introduced. Moreover, the customized GNN architecture is hierarchically permutation-equivariant (HPE), i.e., if the APs or users of an AP are permuted, the output assignment is automatically permuted in the same way. To address the combinatorial problem, we relax it to a continuous problem, and introduce an information entropy-inspired penalty term. The training objective is then formulated using the augmented Lagrangian method (ALM). The test results show that the realized sum-rate outperforms that of the generalized serial dictatorship (GSD) algorithm and is very close to the upper bound in a small network scenario, while the upper bound is impossible to obtain in a large network scenario.
- [39] arXiv:2406.05663 [pdf, html, other]
-
Title: Movable Antenna Assisted OAM Wireless Communications With Misaligned TransceiverSubjects: Signal Processing (eess.SP); Information Theory (cs.IT)
The vortex electromagnetic wave carried by multiple orthogonal orbital angular momentum (OAM) modes in the same frequency band can be applied to the field of wireless communications, which greatly increases the spectrum efficiency. The uniform circular array (UCA) is the classical structure to generate and receive vortex electromagnetic waves with multiple OAM-modes. However, when the transmit and receive UCAs are misaligned, there will be interference among the OAM-modes and the signal cannot be recovered at the receiver. In order to solve this problem, we propose movable antenna (MA) assisted OAM wireless communications scheme. We estimate the rotation angle between transmit and receive UCAs and feed it back to the transmitter. Then, the MA at the transmitter adjusts the rotation angle to achieve alignment of the UCA at both the receiver and transmitter. Simulation results show that our scheme can significantly improve the spectrum efficiency.
- [40] arXiv:2406.05667 [pdf, other]
-
Title: Achieving High Capacity Transmission With N-Dimensional Quasi-Fractal UCASubjects: Signal Processing (eess.SP); Information Theory (cs.IT)
The vortex electromagnetic wave carried by multiple orthogonal orbital angular momentum (OAM) modes in the same frequency band can be applied to the field of wireless communications, which greatly increases the spectrum efficiency. The uniform circular array (UCA) is widely used to generate and receive vortex electromagnetic waves with multiple OAM-modes. However, the maximum number of orthogonal OAM-modes based on UCA is usually limited to the number of array-elements of the UCA antenna, leaving how to utilize more OAM-modes to achieve higher channel capacity with a fixed number of arrayelements as an intriguing question. In this paper, we propose an N-dimensional quasi-fractal UCA (ND QF-UCA) antenna structure in different fractal geometry layouts to break through the limits of array-elements number on OAM-modes number. We develop the N-dimensional OAM modulation (NOM) and demodulation (NOD) schemes for OAM multiplexing transmission with the OAM-modes number exceeding the array-elements number, which is beyond the traditional concept of multiple antenna based wireless communications. Then, we investigate different dimensional multiplexing transmission schemes based on the corresponding QF-UCA antenna structure with various array-element layouts and evaluate the optimal layout type and dimension to obtain the highest channel capacity with a fixed number of array-elements. Simulation results show that our proposed schemes can obtain a higher spectrum efficiency, surpassing those of alternative array-element layouts of QF-UCA and the traditional multiple antenna systems.
- [41] arXiv:2406.05671 [pdf, html, other]
-
Title: Efficient Beamforming Feedback Information-Based Wi-Fi Sensing by Feature SelectionSubjects: Signal Processing (eess.SP)
Wi-Fi sensing leveraging plain-text beamforming feedback information (BFI) in multiple-input-multiple-output (MIMO) systems attracts increasing attention. However, due to the implicit relationship between BFI and the channel state information (CSI), quantifying the sensing capability of BFI poses a challenge in building efficient BFI-based sensing algorithms. In this letter, we first derive a mathematical model of BFI, characterizing its relationship with CSI explicitly, and then develop a closed-form expression of BFI for 2x2 MIMO systems. To enhance the efficiency of BFI-based sensing by selecting only the most informative features, we quantify the sensing capacity of BFI using the Cramer-Rao bound (CRB) and then propose an efficient CRB-based BFI feature selection algorithm. Simulation results verify that BFI and CSI exhibit comparable sensing capabilities and that the proposed algorithm halves the number of features, reducing 20% more parameters than baseline methods, at the cost of only slightly increasing positioning errors.
- [42] arXiv:2406.05672 [pdf, html, other]
-
Title: Text-aware and Context-aware Expressive Audiobook Speech SynthesisComments: Accepted by INTERSPEECH2024Subjects: Audio and Speech Processing (eess.AS)
Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech. However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks without relying on manually labeled data or reference this http URL address this problem, we propose a text-aware and context-aware(TACA) style modeling approach for expressive audiobook speech synthesis. We first establish a text-aware style space to cover diverse styles via contrastive learning with the supervision of the speech style. Meanwhile, we adopt a context encoder to incorporate cross-sentence information and the style embedding obtained from text. Finally, we introduce the context encoder to two typical TTS models, including VITS-based TTS and language model-based TTS. Experimental results demonstrate that our proposed approach can effectively capture diverse styles and coherent prosody, and consequently improves naturalness and expressiveness in audiobook speech synthesis.
- [43] arXiv:2406.05696 [pdf, html, other]
-
Title: Two Power Allocation and Beamforming Strategies for Active IRS-aided Wireless Network via Machine LearningSubjects: Signal Processing (eess.SP)
This paper models an active intelligent reflecting surface (IRS) -assisted wireless communication network, which has the ability to adjust power between BS and IRS. We aim to maximize the signal-to-noise ratio of user by jointly designing power allocation (PA) factor, active IRS phase shift matrix, and beamforming vector of BS, subject to a total power constraint. To tackle this non-convex problem, we solve this problem by alternately optimizing these variables. Firstly, the PA factor is designed via polynomial regression method. Next, BS beamforming vector and IRS phase shift matrix are obtained by Dinkelbach's transform and successive convex approximation methods. To reduce the high computational complexity of the above proposed algorithm, we maximize achievable rate (AR) and use closed-form fractional programming method to transform the original problem into an equivalent form. Then, we address this problem by iteratively optimizing auxiliary variables, BS and IRS beamformings. Simulation results show that the proposed algorithms can effectively improve the AR performance compared to fixed PA strategies, aided by passive IRS, and without IRS.
- [44] arXiv:2406.05699 [pdf, html, other]
-
Title: An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTSXiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Yufei Xia, Jinzhu Li, Sheng Zhao, Jinyu Li, Naoyuki KandaComments: Accepted to INTERSPEECH2024Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audio generated from noisy audio prompts within the context of flow-matching-based zero-shot TTS. Our investigation includes comprehensive training strategies: unsupervised pre-training with masked speech denoising, multi-speaker detection and DNSMOS-based data filtering on the pre-training data, and fine-tuning with random noise mixing. The results of our experiments demonstrate significant improvements in intelligibility, speaker similarity, and overall audio quality compared to the approach of applying speech enhancement to the audio prompt.
- [45] arXiv:2406.05716 [pdf, html, other]
-
Title: Near or far: On determining the appropriate channel estimation strategy in cross-field communicationSubjects: Signal Processing (eess.SP); Information Theory (cs.IT)
The use of ultra-massive multiple-input multiple-output and high-frequency large bandwidth systems is likely in the next-generation wireless communication systems. In such systems, the user moves between near- and far-field regions, and consequently, the channel estimation will need to be carried out in the cross-field scenario. Channel estimation strategies have been proposed for both near- and far-fields, but in the cross-field problem, the first step is to determine whether the near- or far-field is applicable so that an appropriate channel estimation strategy can be employed. In this work, we propose using a hidden Markov model over an ensemble of region estimates to enhance the accuracy of selecting the actual region. The region indicators are calculated using the pair-wise power differences between received signals across the subarrays within an array-of-subarrays architecture. Numerical results show that the proposed method achieves a high success rate in determining the appropriate channel estimation strategy.
- [46] arXiv:2406.05763 [pdf, html, other]
-
Title: WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model BenchmarkLinhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei XieComments: Accepted by INTERSPEECH2024Subjects: Audio and Speech Processing (eess.AS)
With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains $12,800$ hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface.
- [47] arXiv:2406.05780 [pdf, html, other]
-
Title: Two-Stage Resource Allocation in Reconfigurable Intelligent Surface Assisted Hybrid Networks via Multi-Player BanditsComments: This paper was published in IEEE Transcation on CommunicationsSubjects: Signal Processing (eess.SP)
This paper considers a resource allocation problem where several Internet-of-Things (IoT) devices send data to a base station (BS) with or without the help of the reconfigurable intelligent surface (RIS) assisted cellular network. The objective is to maximize the sum rate of all IoT devices by finding the optimal RIS and spreading factor (SF) for each device. Since these IoT devices lack prior information on the RISs or the channel state information (CSI), a distributed resource allocation framework with low complexity and learning features is required to achieve this goal. Therefore, we model this problem as a two-stage multi-player multi-armed bandit (MPMAB) framework to learn the optimal RIS and SF sequentially. Then, we put forth an exploration and exploitation boosting (E2Boost) algorithm to solve this two-stage MPMAB problem by combining the $\epsilon$-greedy algorithm, Thompson sampling (TS) algorithm, and non-cooperation game method. We derive an upper regret bound for the proposed algorithm, i.e., $\mathcal{O}(\log^{1+\delta}_2 T)$, increasing logarithmically with the time horizon $T$. Numerical results show that the E2Boost algorithm has the best performance among the existing methods and exhibits a fast convergence rate. More importantly, the proposed algorithm is not sensitive to the number of combinations of the RISs and SFs thanks to the two-stage allocation mechanism, which can benefit high-density networks.
- [48] arXiv:2406.05790 [pdf, other]
-
Title: Integrated Sensing and Communication for Anti-Jamming with OAMSubjects: Signal Processing (eess.SP)
The spectrum share and open nature of wireless channels enable integrated sensing and communication (ISAC) susceptible to hostile jamming attacks. Due to the intrinsic orthogonality and rich azimuth angle information of orbital angular momentum (OAM), vortex electromagnetic waves with helical phase fronts have shown great potential to achieve high-resolution imaging and strong anti-jamming capability of wireless communication. Focusing on significantly enhancing the anti-jamming results of ISAC systems with limited bandwidth under hostile jamming, in this paper we propose a novel ISAC for anti-jamming with OAM scheme, where the OAM legitimate transmitter can simultaneously sense the position of jammers with dynamic behavior and send data to multiple OAM legitimate users. Specifically, the OAM modes for sensing and communications are respectively hopped according to pre-set index modulation information to suppress jamming. To acquire the position of the jammer, we develop the enhanced multiple-signal-classification-based three-dimension position estimation scheme with continuous sensing in both frequency and angular domains, where the OAM transmitter is designed with the concentric uniform-circular-array mono-static method, to significantly increase the azimuthal resolution. Then, based on the acquired jamming channel state information, we develop the joint transmit-receive beamforming and power allocation scheme, where the transmit and receive beamforming matrices are dynamically adjusted to mitigate the mixed interference containing inter-mode interference, inter-user interference, and jamming, thus maximizing the achievable sum rates (ASRs) of all users. Numerical results demonstrate that our proposed scheme can significantly increase the ASR under broadband jamming attacks and achieve high detection accuracy of targets .
- [49] arXiv:2406.05799 [pdf, html, other]
-
Title: Double-RIS-Assisted Orbital Angular Momentum Near-Field Secure CommunicationsSubjects: Signal Processing (eess.SP)
To satisfy the various demands of growing devices and services, emerging high-frequency-based technologies promote near-field wireless communications. Therefore, near-field physical layer security has attracted much attention to facilitate the wireless information security against illegitimate eavesdropping. However, highly correlated channels between legitimate transceivers and eavesdroppers of existing multiple-input multiple-output (MIMO) based near-field secure technologies along with the low degrees of freedom significantly limit the enhancement of security results in wireless communications. To significantly increase the secrecy rates of near-field wireless communications, in this paper we propose the double-reconfigurable-intelligent-surface (RIS) assisted orbital angular momentum (OAM) secure scheme, where RISs with few reflecting elements are easily deployed to reconstruct the direct links blocked by obstacles between the legitimate transceivers, mitigate the inter-mode interference caused by the misalignment of legitimate transceivers, and adjust the OAM beams direction to interfere with eavesdroppers. Meanwhile, due to the unique orthogonality among OAM modes, the OAM-based joint index modulation and artificial noise scheme is proposed to weaken the information acquisition by eavesdroppers while increasing the achievable rate with the low cost of legitimate communications. To maximize the secrecy rate of our proposed scheme, we develop the Riemannian manifold conjugate gradient (RMCG)-based alternative optimization (AO) algorithm to jointly optimize the transmit power allocation of OAM modes and phase shifts of double RISs. Numerical results show that our proposed double-RIS-assisted OAM near-field secure scheme outperforms the existing works in terms of the secrecy rate and the eavesdropper's bit error rate.
- [50] arXiv:2406.05839 [pdf, html, other]
-
Title: MaLa-ASR: Multimedia-Assisted LLM-Based ASRSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
As more and more information-rich data like video become available, utilizing multi-modal auxiliary information to enhance audio tasks has sparked widespread research interest. The recent surge in research on LLM-based audio models provides fresh perspectives for tackling audio tasks. Given that LLM can flexibly ingest multiple inputs, we propose MaLa-ASR, an LLM-based ASR model that can integrate textual keywords extracted from presentation slides to improve recognition of conference content. MaLa-ASR yields average WERs of 9.4% and 11.7% on the L95 and S95 subsets of the SlideSpeech corpus, representing a significant relative WER drop of 27.9% and 44.7% over the baseline model reported in SlideSpeech. MaLa-ASR underscores LLM's strong performance in speech tasks and the capability to integrate auxiliary information conveniently. By adding keywords to the input prompt, the biased word error rate (B-WER) reduces relatively by 46.0% and 44.2%, establishing a new SOTA on this dataset.
- [51] arXiv:2406.05844 [pdf, html, other]
-
Title: Spatial Correlation Modeling and RS-LS Estimation of Near-Field Channels with Uniform Planar ArraysSubjects: Signal Processing (eess.SP)
Extremely large aperture arrays (ELAAs) can offer massive spatial multiplexing gains in the radiative near-field region in beyond 5G systems. While near-field channel modeling for uniform linear arrays has been extensively explored in the literature, uniform planar arrays-despite their advantageous form factor-have been somewhat neglected due to their more complex nature. Spatial correlation is crucial for non-line-of-sight channel modeling. Unlike far-field scenarios, the spatial correlation properties of near-field channels have not been thoroughly investigated. In this paper, we start from the fundamentals and develop a near-field spatial correlation model for arbitrary spatial scattering functions. Furthermore, we derive the lower dimensional subspace where the channel vectors can exist. It is based on prior knowledge of the three-dimensional coverage region where scattering clusters exists and we derive a tractable one-dimensional integral expression. This subspace is subsequently employed in a reduced-subspace least squares (RS-LS) estimation method for near-field channels, thereby enhancing performance over the traditional least squares estimator without the need for having full spatial correlation matrix knowledge.
- [52] arXiv:2406.05891 [pdf, other]
-
Title: GCtx-UNet: Efficient Network for Medical Image SegmentationComments: 13 pages, 7 figures, 7 tablesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Medical image segmentation is crucial for disease diagnosis and monitoring. Though effective, the current segmentation networks such as UNet struggle with capturing long-range features. More accurate models such as TransUNet, Swin-UNet, and CS-UNet have higher computation complexity. To address this problem, we propose GCtx-UNet, a lightweight segmentation architecture that can capture global and local image features with accuracy better or comparable to the state-of-the-art approaches. GCtx-UNet uses vision transformer that leverages global context self-attention modules joined with local self-attention to model long and short range spatial dependencies. GCtx-UNet is evaluated on the Synapse multi-organ abdominal CT dataset, the ACDC cardiac MRI dataset, and several polyp segmentation datasets. In terms of Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) metrics, GCtx-UNet outperformed CNN-based and Transformer-based approaches, with notable gains in the segmentation of complex and small anatomical structures. Moreover, GCtx-UNet is much more efficient than the state-of-the-art approaches with smaller model size, lower computation workload, and faster training and inference speed, making it a practical choice for clinical applications.
- [53] arXiv:2406.05914 [pdf, html, other]
-
Title: Soundscape Captioning using Sound Affective Quality Network and Large Language ModelComments: Code: this https URLSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
We live in a rich and varied acoustic world, which is experienced by individuals or communities as a soundscape. Computational auditory scene analysis, disentangling acoustic scenes by detecting and classifying events, focuses on objective attributes of sounds, such as their category and temporal characteristics, ignoring the effect of sounds on people and failing to explore the relationship between sounds and the emotions they evoke within a context. To fill this gap and to automate soundscape analysis, which traditionally relies on labour-intensive subjective ratings and surveys, we propose the soundscape captioning (SoundSCap) task. SoundSCap generates context-aware soundscape descriptions by capturing the acoustic scene, event information, and the corresponding human affective qualities. To this end, we propose an automatic soundscape captioner (SoundSCaper) composed of an acoustic model, SoundAQnet, and a general large language model (LLM). SoundAQnet simultaneously models multi-scale information about acoustic scenes, events, and perceived affective qualities, while LLM generates soundscape captions by parsing the information captured by SoundAQnet to a common language. The soundscape caption's quality is assessed by a jury of 16 audio/soundscape experts. The average score (out of 5) of SoundSCaper-generated captions is lower than the score of captions generated by two soundscape experts by 0.21 and 0.25, respectively, on the evaluation set and the model-unknown mixed external dataset with varying lengths and acoustic properties, but the differences are not statistically significant. Overall, SoundSCaper-generated captions show promising performance compared to captions annotated by soundscape experts. The models' code, LLM scripts, human assessment data and instructions, and expert evaluation statistics are all publicly available.
- [54] arXiv:2406.05924 [pdf, html, other]
-
Title: Imageless Contraband Detection Using a Millimeter-Wave Dynamic Antenna Array via Spatial Fourier Domain SamplingComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Signal Processing (eess.SP)
We demonstrate an imageless method of concealed contraband detection using a real-time 75 GHz rotationally dynamic antenna array. The array measures information in the two-dimensional Fourier domain and captures a set of samples that is sufficient for detecting concealed objects yet insufficient for generating full image, thereby preserving the privacy of screened subjects. The small set of Fourier samples contains sharp spatial frequency features in the Fourier domain which correspond to sharp edges of man-made objects such as handguns. We evaluate a set of classification methods: threshold-based, K-nearest neighbor, and support vector machine using radial basis function; all operating on arithmetic features directly extracted from the sampled Fourier-domain responses measured by a dynamically rotating millimeter-wave active interferometer. Noise transmitters are used to produce thermal-like radiation from scenes, enabling direct Fourier-domain sampling, while the rotational dynamics circularly sample the two-dimensional Fourier domain, capturing the sharp-edge induced responses. We experimentally demonstrate the detection of concealed metallic gun-shape object beneath clothing on a real person in a laboratory environment and achieved an accuracy and F1-score both at 0.986. The presented technique not only prevents image formation due to efficient Fourier-domain space sub-sampling but also requires only 211 ms from measurement to decision.
- [55] arXiv:2406.05945 [pdf, html, other]
-
Title: Machine Unlearning for Uplink Interference CancellationSubjects: Signal Processing (eess.SP)
Machine unlearning (MUL) is introduced as a means to achieve interference cancellation within artificial intelligence (AI)-enabled wireless systems. It is observed that interference cancellation with MUL demonstrates $30\%$ improvement in a classification task accuracy in the presence of a corrupted AI model. Accordingly, the necessity for instantaneous channel state information for existing interference source is eliminated and a corrupted latent space with interference noise is cleansed with MUL algorithm, achieving this without the necessity for either retraining or dataset cleansing. A Membership Interference Attack (MIA) served as a benchmark for assessing the efficacy of MUL in mitigating interference within a neural network model. The advantage of the MUL algorithm was determined by evaluating both the probability of interference and the quantity of samples requiring retraining. In a simple signal-to-noise ratio classification task, the comprehensive improvement across various test cases in terms of accuracy demonstrates that MUL exhibits extensive capabilities and limitations, particularly in native AI applications.
- [56] arXiv:2406.05947 [pdf, html, other]
-
Title: Accent Conversion with Articulatory RepresentationsYashish M. Siriwardena, Nathan Swedlow, Audrey Howard, Evan Gitterman, Dan Darcy, Carol Espy-Wilson, Andrea FanelliComments: Accepted at INTERSPEECH 2024Subjects: Audio and Speech Processing (eess.AS)
Conversion of non-native accented speech to native (American) English has a wide range of applications such as improving intelligibility of non-native speech. Previous work on this domain has used phonetic posteriograms as the target speech representation to train an acoustic model which is then used to extract a compact representation of input speech for accent conversion. In this work, we introduce the idea of using an effective articulatory speech representation, extracted from an acoustic-to-articulatory speech inversion system, to improve the acoustic model used in accent conversion. The idea to incorporate articulatory representations originates from their ability to well characterize accents in speech. To incorporate articulatory representations with conventional phonetic posteriograms, a multi-task learning based acoustic model is proposed. Objective and subjective evaluations show that the use of articulatory representations can improve the effectiveness of accent conversion.
- [57] arXiv:2406.05950 [pdf, other]
-
Title: Economic and Environmental Sustainability Through Reshoring: A Case StudyComments: 12 pages, 5 figures, 8 tablesSubjects: Systems and Control (eess.SY)
Not too long ago, offshoring was considered a panacea for many U.S. companies to achieve economic sustainability. Offshoring also created an unnecessary movement of goods between the point of consumption and the point of sourcing and hence contributed to greenhouse gas emissions. With many things changed, hundreds of U.S. companies have started Reshoring. Due to supply chain disruptions and increased tax implications, including tariffs, there is a growing desire among companies to achieve economic and environmental sustainability through reshoring. This model case study highlighted the common offshoring challenges and demonstrated new methods/solutions for the companies to save their bottom line. Using the Reshorability Index (RI) and Total Cost of Ownership (TCO) we developed a model to show which products or components we should bring back to the U.S. instead of continuing offshoring. From this study, we have found out that reshoring is not only an economically profitable decision but also has a positive impact on reducing GHG (Greenhouse Gas) emissions. Our research found that the companies that currently offshore heavy products will benefit more from implementing our developed model. Leveraging this model, industries can identify, and compare the ownership cost of their purchased materials and take the decision on potential reshoring. Additionally, companies will be able to calculate the GHG emission and identify the reduction of such emissions due to reshoring.
- [58] arXiv:2406.05961 [pdf, html, other]
-
Title: BS-PLCNet 2: Two-stage Band-split Packet Loss Concealment Network with Intra-model Knowledge DistillationComments: Accepted by Interspeech 2024Subjects: Audio and Speech Processing (eess.AS)
Audio packet loss is an inevitable problem in real-time speech communication. A band-split packet loss concealment network (BS-PLCNet) targeting full-band signals was recently proposed. Although it performs superiorly in the ICASSP 2024 PLC Challenge, BS-PLCNet is a large model with high computational complexity of 8.95G FLOPS. This paper presents its updated version, BS-PLCNet 2, to reduce computational complexity and improve performance further. Specifically, to compensate for the missing future information, in the wide-band module, we design a dual-path encoder structure (with non-causal and causal path) and leverage an intra-model knowledge distillation strategy to distill the future information from the non-causal teacher to the casual student. Moreover, we introduce a lightweight post-processing module after packet loss restoration to recover speech distortions and remove residual noise in the audio signal. With only 40% of original parameters in BS-PLCNet, BS-PLCNet 2 brings 0.18 PLCMOS improvement on the ICASSP 2024 PLC challenge blind set, achieving state-of-the-art performance on this dataset.
- [59] arXiv:2406.05965 [pdf, html, other]
-
Title: MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion GuidanceComments: Accepted to Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
In this paper, we propose MakeSinger, a semi-supervised training method for singing voice synthesis (SVS) via classifier-free diffusion guidance. The challenge in SVS lies in the costly process of gathering aligned sets of text, pitch, and audio data. MakeSinger enables the training of the diffusion-based SVS model from any speech and singing voice data regardless of its labeling, thereby enhancing the quality of generated voices with large amount of unlabeled data. At inference, our novel dual guiding mechanism gives text and pitch guidance on the reverse diffusion step by estimating the score of masked input. Experimental results show that the model trained in a semi-supervised manner outperforms other baselines trained only on the labeled data in terms of pronunciation, pitch accuracy and overall quality. Furthermore, we demonstrate that by adding Text-to-Speech (TTS) data in training, the model can synthesize the singing voices of TTS speakers even without their singing voices.
- [60] arXiv:2406.05966 [pdf, html, other]
-
Title: Approximating arrival costs in distributed moving horizon estimation: A recursive methodSubjects: Systems and Control (eess.SY)
In this paper, we present a new approach to distributed moving horizon estimation for constrained nonlinear processes. The method involves approximating the arrival costs of local estimators through a recursive framework. First, distributed full-information estimation for linear unconstrained systems is presented, which serves as the foundation for deriving the analytical expression of the arrival costs for the local estimators. Subsequently, we develop a recursive arrival cost design for linear distributed moving horizon estimation. Sufficient conditions are derived to ensure the stability of the estimation error for constrained linear systems. Next, we extend the arrival cost design derived for linear systems to account for nonlinear systems, and a partition-based constrained distributed moving horizon estimation algorithm for nonlinear systems is formulated. A benchmark chemical process is used to illustrate the effectiveness and superiority of the proposed method.
- [61] arXiv:2406.05968 [pdf, html, other]
-
Title: Prompting Large Language Models with Audio for General-Purpose Speech SummarizationComments: Accepted to Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
In this work, we introduce a framework for speech summarization that leverages the processing and reasoning capabilities of large language models (LLMs). We propose an end-to-end system that combines an instruction-tuned LLM with an audio encoder that converts speech into token representations that the LLM can interpret. Using a dataset with paired speech-text data, the overall system is trained to generate consistent responses to prompts with the same semantic information regardless of the input modality. The resulting framework allows the LLM to process speech inputs in the same way as text, enabling speech summarization by simply prompting the LLM. Unlike prior approaches, our method is able to summarize spoken content from any arbitrary domain, and it can produce summaries in different styles by varying the LLM prompting strategy. Experiments demonstrate that our approach outperforms a cascade baseline of speech recognition followed by LLM text processing.
- [62] arXiv:2406.05974 [pdf, html, other]
-
Title: Inter-slice Super-resolution of Magnetic Resonance Images by Pre-training and Self-supervised Fine-tuningComments: ISBI 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In clinical practice, 2D magnetic resonance (MR) sequences are widely adopted. While individual 2D slices can be stacked to form a 3D volume, the relatively large slice spacing can pose challenges for both image visualization and subsequent analysis tasks, which often require isotropic voxel spacing. To reduce slice spacing, deep-learning-based super-resolution techniques are widely investigated. However, most current solutions require a substantial number of paired high-resolution and low-resolution images for supervised training, which are typically unavailable in real-world scenarios. In this work, we propose a self-supervised super-resolution framework for inter-slice super-resolution of MR images. Our framework is first featured by pre-training on video dataset, as temporal correlation of videos is found beneficial for modeling the spatial relation among MR slices. Then, we use public high-quality MR dataset to fine-tune our pre-trained model, for enhancing awareness of our model to medical data. Finally, given a target dataset at hand, we utilize self-supervised fine-tuning to further ensure our model works well with user-specific super-resolution tasks. The proposed method demonstrates superior performance compared to other self-supervised methods and also holds the potential to benefit various downstream applications.
- [63] arXiv:2406.05976 [pdf, html, other]
-
Title: Dynamic Virtual Power Plants With Frequency Regulation CapacityComments: Accepted by IAS Annual Meeting 2024Subjects: Systems and Control (eess.SY)
For integrating heterogeneous distributed energy resources to provide fast frequency regulation, this paper proposes a dynamic virtual power plant~(DVPP) with frequency regulation capacity. A parameter anonymity-based approach is established for DVPP aggregating small-scaled inverter-based resources~(IBRs) with privacy concerns. On this basis, a parameter-to-performance mapping is formulated to evaluate how control coefficients impact the DVPP-level power overshoot as well as the IBR-level costs. The objective is to design the best way to provide the frequency response with minimal impacts on grid and the most financial gains. Numerical experiments illustrate the effectiveness of the proposed approach and further analysis validates that our models are able to take dead bands into consideration.
- [64] arXiv:2406.05982 [pdf, other]
-
Title: Artificial Intelligence for Neuro MRI Acquisition: A ReviewHongjia Yang, Guanhua Wang, Ziyu Li, Haoxiang Li, Jialan Zheng, Yuxin Hu, Xiaozhi Cao, Congyu Liao, Huihui Ye, Qiyuan TianComments: Submitted to MAGMA for reviewSubjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
Magnetic resonance imaging (MRI) has significantly benefited from the resurgence of artificial intelligence (AI). By leveraging AI's capabilities in large-scale optimization and pattern recognition, innovative methods are transforming the MRI acquisition workflow, including planning, sequence design, and correction of acquisition artifacts. These emerging algorithms demonstrate substantial potential in enhancing the efficiency and throughput of acquisition steps. This review discusses several pivotal AI-based methods in neuro MRI acquisition, focusing on their technological advances, impact on clinical practice, and potential risks.
- [65] arXiv:2406.05983 [pdf, html, other]
-
Title: Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech SeparationComments: Project Page this https URLSubjects: Audio and Speech Processing (eess.AS)
Since the success of a time-domain speech separation, further improvements have been made by expanding the length and channel of a feature sequence to increase the amount of computation. When temporally expanded to a long sequence, the feature is segmented into chunks as a dual-path model in most studies of speech separation. In particular, it is common for the process of separating features corresponding to each speaker to be located in the final stage of the network. However, it is more advantageous and intuitive to proactively expand the feature sequence to include the number of speakers as an extra dimension. In this paper, we present an asymmetric strategy in which the encoder and decoder are partitioned to perform distinct processing in separation tasks. The encoder analyzes features, and the output of the encoder is split into the number of speakers to be separated. The separated sequences are then reconstructed by the weight-shared decoder, as Siamese network, in addition to cross-speaker processing. By using the Siamese network in the decoder, without using speaker information, the network directly learns to discriminate the features using a separation objective. With a common split layer, intermediate encoder features for skip connections are also split for the reconstruction decoder based on the U-Net structure. In addition, instead of segmenting the feature into chunks as dual-path, we design global and local Transformer blocks to directly process long sequences. The experimental results demonstrated that this separation-and-reconstruction framework is effective and that the combination of proposed global and local Transformer can sufficiently replace the role of inter- and intra-chunk processing in dual-path structure. Finally, the presented model including both of these achieved state-of-the-art performance with less computation than before in various benchmark datasets.
- [66] arXiv:2406.05992 [pdf, html, other]
-
Title: MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision MambaComments: 11 pages, 5 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Recently, State Space Models (SSMs), with Mamba as a prime example, have shown great promise for long-range dependency modeling with linear complexity. Then, Vision Mamba and the subsequent architectures are presented successively, and they perform well on visual tasks. The crucial step of applying Mamba to visual tasks is to construct 2D visual features in sequential manners. To effectively organize and construct visual features within the 2D image space through 1D selective scan, we propose a novel Multi-Head Scan (MHS) module. The embeddings extracted from the preceding layer are projected into multiple lower-dimensional subspaces. Subsequently, within each subspace, the selective scan is performed along distinct scan routes. The resulting sub-embeddings, obtained from the multi-head scan process, are then integrated and ultimately projected back into the high-dimensional space. Moreover, we incorporate a Scan Route Attention (SRA) mechanism to enhance the module's capability to discern complex structures. To validate the efficacy of our module, we exclusively substitute the 2D-Selective-Scan (SS2D) block in VM-UNet with our proposed module, and we train our models from scratch without using any pre-trained weights. The results indicate a significant improvement in performance while reducing the parameters of the original VM-UNet. The code for this study is publicly available at this https URL.
- [67] arXiv:2406.06017 [pdf, html, other]
-
Title: Neuro-TransUNet: Segmentation of stroke lesion in MRI using transformersComments: 10 pages, 6 figuresSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
Accurate segmentation of the stroke lesions using magnetic resonance imaging (MRI) is associated with difficulties due to the complicated anatomy of the brain and the different properties of the lesions. This study introduces the Neuro-TransUNet framework, which synergizes the U-Net's spatial feature extraction with SwinUNETR's global contextual processing ability, further enhanced by advanced feature fusion and segmentation synthesis techniques. The comprehensive data pre-processing pipeline improves the framework's efficiency, which involves resampling, bias correction, and data standardization, enhancing data quality and consistency. Ablation studies confirm the significant impact of the advanced integration of U-Net with SwinUNETR and data pre-processing pipelines on performance and demonstrate the model's effectiveness. The proposed Neuro-TransUNet model, trained with the ATLAS v2.0 \emph{training} dataset, outperforms existing deep learning algorithms and establishes a new benchmark in stroke lesion segmentation.
- [68] arXiv:2406.06024 [pdf, html, other]
-
Title: Zak-OTFS and Turbo Signal Processing for Joint Sensing and CommunicationJinu Jayachandran, Muhammad Ubadah, Saif Khan Mohammed, Ronny Hadani, Ananthanarayanan Chockalingam, Robert CalderbankComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. arXiv admin note: text overlap with arXiv:2404.04182Subjects: Signal Processing (eess.SP); Information Theory (cs.IT)
The Zak-OTFS input/output (I/O) relation is predictable and non-fading when the delay and Doppler periods are greater than the effective channel delay and Doppler spreads, a condition which we refer to as the crystallization condition. The filter taps can simply be read off from the response to a single Zak-OTFS pilot pulsone, and the I/O relation can be reconstructed for a sampled system that operates under finite duration and bandwidth constraints. In previous work we had measured BER performance of a baseline system where we used separate Zak-OTFS subframes for sensing and data transmission. In this Letter we demonstrate how to use turbo signal processing to match BER performance of this baseline system when we integrate sensing and communication within the same Zak-OTFS subframe. The turbo decoder alternates between channel sensing using a noise-like waveform (spread pulsone) and recovery of data transmitted using point pulsones.
- [69] arXiv:2406.06098 [pdf, html, other]
-
Title: Economic Model Predictive Control of Water Distribution Systems with Accelerated Optimization AlgorithmSubjects: Systems and Control (eess.SY)
Model predictive control (MPC) has emerged as an effective strategy for water distribution systems (WDSs) management. However, it is hampered by the computational burden for large-scale WDSs due to the combinatorial growth of possible control actions that must be evaluated at each time step. Therefore, a fast computation algorithm to implement MPC in WDSs can be obtained using a move-blocking approach that simplifies control decisions while ensuring solution feasibility. This paper introduces a least-restrictive move-blocking that interpolates the blocked control rate of change, aiming at balancing computational efficiency with operational effectiveness. The proposed control strategy is demonstrated on aggregated WDSs, encompassing multiple hydraulic elements. This implementation is incorporated into a multi-objective optimization framework that concurrently optimizes water level security of the storage tanks, smoothness of the control actions, and cost-effective objectives. A fair comparison between the proposed approach with the non-blocking Economic MPC is provided.
- [70] arXiv:2406.06111 [pdf, html, other]
-
Title: JenGAN: Stacked Shifted Filters in GAN-Based Speech SynthesisComments: Accepted to Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
Non-autoregressive GAN-based neural vocoders are widely used due to their fast inference speed and high perceptual quality. However, they often suffer from audible artifacts such as tonal artifacts in their generated results. Therefore, we propose JenGAN, a new training strategy that involves stacking shifted low-pass filters to ensure the shift-equivariant property. This method helps prevent aliasing and reduce artifacts while preserving the model structure used during inference. In our experimental evaluation, JenGAN consistently enhances the performance of vocoder models, yielding significantly superior scores across the majority of evaluation metrics.
- [71] arXiv:2406.06116 [pdf, other]
-
Title: Model Updating for Nonlinear Systems with Stability GuaranteesComments: arXiv admin note: text overlap with arXiv:2310.20568Subjects: Systems and Control (eess.SY)
To improve the predictive capacity of system models in the input-output sense, this paper presents a framework for model updating via learning of modeling uncertainties in locally (and thus also in globally) Lipschitz nonlinear systems. First, we introduce a method to extend an existing known model with an uncertainty model so that stability of the extended model is guaranteed in the sense of set invariance and input-to-state stability. To achieve this, we provide two tractable semi-definite programs. These programs allow obtaining optimal uncertainty model parameters for both locally and globally Lipschitz nonlinear models, given uncertainty and state trajectories. Subsequently, in order to extract this data from the available input-output trajectories, we introduce a filter that incorporates an approximated internal model of the uncertainty and asymptotically estimates uncertainty and state realizations. This filter is also synthesized using semi-definite programs with guaranteed robustness with respect to uncertainty model mismatches, disturbances, and noise. Numerical simulations for a large data-set of a roll plane model of a vehicle illustrate the effectiveness and practicality of the proposed methodology in improving model accuracy, while guaranteeing stability.
- [72] arXiv:2406.06151 [pdf, html, other]
-
Title: Computationally Efficient Machine-Learning-Based Online Battery State of Health EstimationSubjects: Systems and Control (eess.SY)
A key function of battery management systems (BMS) in e-mobility applications is estimating the battery state of health (SoH) with high accuracy. This is typically achieved in commercial BMS using model-based methods. There has been considerable research in developing data-driven methods for improving the accuracy of SoH estimation. The data-driven methods are diverse and use different machine-learning (ML) or artificial intelligence (AI) based techniques. Complex AI/ML techniques are difficult to implement in low-cost microcontrollers used in BMS due to the extensive use of non-linear functions and large matrix operations. This paper proposes a computationally efficient and data-lightweight SoH estimation technique. Online impedance at four discrete frequencies is evaluated to derive the features of a linear regression problem. The proposed solution avoids complex mathematical operations and it is well-suited for online implementation in a commercial BMS. The accuracy of this method is validated on two experimental datasets and is shown to have a mean absolute error (MAE) of less than 2% across diverse training and testing data.
- [73] arXiv:2406.06157 [pdf, html, other]
-
Title: Model predictive control for tracking using artificial references: Fundamentals, recent results and practical implementationPablo Krupa, Johannes Köhler, Antonio Ferramosca, Ignacio Alvarado, Melanie N. Zeilinger, Teodoro Alamo, Daniel LimonComments: (15 pages, 1 figure)Subjects: Systems and Control (eess.SY)
This paper provides a comprehensive tutorial on a family of Model Predictive Control (MPC) formulations, known as MPC for tracking, which are characterized by including an artificial reference as part of the decision variables in the optimization problem. These formulations have several benefits with respect to the classical MPC formulations, including guaranteed recursive feasibility under online reference changes, as well as asymptotic stability and an increased domain of attraction. This tutorial paper introduces the concept of using an artificial reference in MPC, presenting the benefits and theoretical guarantees obtained by its use. We then provide a survey of the main advances and extensions of the original linear MPC for tracking, including its non-linear extension. Additionally, we discuss its application to learning-based MPC, and discuss optimization aspects related to its implementation.
- [74] arXiv:2406.06160 [pdf, html, other]
-
Title: The Effect of Training Dataset Size on Discriminative and Diffusion-Based Speech Enhancement SystemsSubjects: Audio and Speech Processing (eess.AS)
The performance of deep neural network-based speech enhancement systems typically increases with the training dataset size. However, studies that investigated the effect of training dataset size on speech enhancement performance did not consider recent approaches, such as diffusion-based generative models. Diffusion models are typically trained with massive datasets for image generation tasks, but whether this is also required for speech enhancement is unknown. Moreover, studies that investigated the effect of training dataset size did not control for the data diversity. It is thus unclear whether the performance improvement was due to the increased dataset size or diversity. Therefore, we systematically investigate the effect of training dataset size on the performance of popular state-of-the-art discriminative and diffusion-based speech enhancement systems. We control for the data diversity by using a fixed set of speech utterances, noise segments and binaural room impulse responses to generate datasets of different sizes. We find that the diffusion-based systems do not benefit from increasing the training dataset size as much as the discriminative systems. They perform the best relative to the discriminative systems with datasets of 10 h or less, but they are outperformed by the discriminative systems with datasets of 100 h or more.
- [75] arXiv:2406.06185 [pdf, html, other]
-
Title: EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and DereverberationJulius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo GerkmannComments: Accepted at Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality speech dataset comprising 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data. The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech. We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics. In addition, we conduct a listening test with 20 participants for the speech enhancement task, where a generative method is preferred. We introduce a blind test set that allows for automatic online evaluation of uploaded data. Dataset download links and automatic evaluation server can be found online.
- [76] arXiv:2406.06220 [pdf, html, other]
-
Title: Label-Looping: Highly Efficient Decoding for TransducersSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
This paper introduces a highly efficient greedy decoding algorithm for Transducer inference. We propose a novel data structure using CUDA tensors to represent partial hypotheses in a batch that supports parallelized hypothesis manipulations. During decoding, our algorithm maximizes GPU parallelism by adopting a nested-loop design, where the inner loop consumes all blank predictions, while non-blank predictions are handled in the outer loop. Our algorithm is general-purpose and can work with both conventional Transducers and Token-and-Duration Transducers. Experiments show that the label-looping algorithm can bring a speedup up to 2.0X compared to conventional batched decoding algorithms when using batch size 32, and can be combined with other compiler or GPU call-related techniques to bring more speedup. We will open-source our implementation to benefit the research community.
- [77] arXiv:2406.06245 [pdf, html, other]
-
Title: A Lora-Based and Maintenance-Free Cattle Monitoring System for Alpine Pastures and Remote LocationsSubjects: Systems and Control (eess.SY); Signal Processing (eess.SP)
The advent of the Internet of Things (IoT) is boosting the proliferation of sensors and smart devices in industry and daily life. Continuous monitoring IoT systems are also finding applications in agriculture, particularly in the realm of smart farming. The adoption of wearable sensors to record the activity of livestock has garnered increasing interest. Such a device enables farmers to locate, monitor, and constantly assess the health status of their cattle more efficiently and effectively, even in challenging terrain and remote locations. This work presents a maintenance-free and robust smart sensing system that is capable of tracking cattle in remote locations and collecting activity parameters, such as the individual's grazing- and resting time. To support the paradigm of smart farming, the cattle tracker is capable of monitoring the cow's activity by analyzing data from an accelerometer, magnetometer, temperature sensor, and Global Navigation Satellite System (GNSS) module, providing them over Long Range Wide Area Network (LoRaWAN) to a backend server. By consuming 511.9 J per day with all subsystems enabled and a data transmission every 15 minutes, the custom-designed sensor node achieves a battery lifetime of 4 months. When exploiting the integrated solar energy harvesting subsystem, this can be even increased by 40% to up to 6 months. The final sensing system's robust operation is proven in a trial run with two cows on a pasture for over three days. Evaluations of the experimental results clearly show behavior patterns, which confirms the practicability of the proposed solution.
- [78] arXiv:2406.06247 [pdf, other]
-
Title: Image Compression with Isotropic and Anisotropic Shepard InpaintingComments: 37 pages, 8 figuresSubjects: Image and Video Processing (eess.IV)
Inpainting-based codecs store sparse selected pixel data and decode by reconstructing the discarded image parts by inpainting. Successful codecs (coders and decoders) traditionally use inpainting operators that solve partial differential equations. This requires some numerical expertise if efficient implementations are necessary. Our goal is to investigate variants of Shepard inpainting as simple alternatives for inpainting-based compression. They can be implemented efficiently when we localise their weighting function. To turn them into viable codecs, we have to introduce novel extensions of classical Shepard interpolation that adapt successful ideas from previous codecs: Anisotropy allows direction-dependent inpainting, which improves reconstruction quality. Additionally, we incorporate data selection by subdivision as an efficient way to tailor the stored information to the image structure. On the encoding side, we introduce the novel concept of joint inpainting and prediction for isotropic Shepard codecs, where storage cost can be reduced based on intermediate inpainting results. In an ablation study, we show the usefulness of these individual contributions and demonstrate that they offer synergies which elevate the performance of Shepard inpainting to surprising levels. Our resulting approaches offer a more favourable trade-off between simplicity and quality than traditional inpainting-based codecs. Experiments show that they can outperform JPEG and JPEG2000 at high compression ratios.
- [79] arXiv:2406.06251 [pdf, html, other]
-
Title: Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-TuningComments: Accepted by InterSpeech 2024Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
As the scale of generative models continues to grow, efficient reuse and adaptation of pre-trained models have become crucial considerations. In this work, we propose Voicebox Adapter, a novel approach that integrates fine-grained conditions into a pre-trained Voicebox speech generation model using a cross-attention module. To ensure a smooth integration of newly added modules with pre-trained ones, we explore various efficient fine-tuning approaches. Our experiment shows that the LoRA with bias-tuning configuration yields the best performance, enhancing controllability without compromising speech quality. Across three fine-grained conditional generation tasks, we demonstrate the effectiveness and resource efficiency of Voicebox Adapter. Follow-up experiments further highlight the robustness of Voicebox Adapter across diverse data setups.
- [80] arXiv:2406.06252 [pdf, html, other]
-
Title: Random Time-hopping Secure Ranging Strategy Against Distance-Reduction Attacks in UWBSubjects: Signal Processing (eess.SP); Cryptography and Security (cs.CR)
In order to mitigate the distance reduction attack in Ultra-Wide Band (UWB) ranging, this paper proposes a secure ranging scheme based on a random time-hopping mechanism without redundant signaling overhead. Additionally, a secure ranging strategy is designed for backward compatibility with existing standards such as IEEE 802.15.4a/z, combined with an attack detection scheme. The effectiveness and feasibility of the proposed strategy are demonstrated through both simulation and experimental results in the case of the Ghost Peak attack, as demonstrated by Patrick Leu et al. The random time-hopping mechanism is verified to be capable of reducing the success rate of distance reduction attacks to less than 0.01%, thereby significantly enhancing the security of UWB ranging.
- [81] arXiv:2406.06253 [pdf, html, other]
-
Title: PretVM: Predictable, Efficient Virtual Machine for Real-Time ConcurrencyShaokai Lin, Erling Jellum, Mirco Theile, Tassilo Tanneberger, Binqi Sun, Chadlia Jerad, Ruomu Xu, Guangyu Feng, Christian Menard, Marten Lohstroh, Jeronimo Castrillon, Sanjit Seshia, Edward LeeSubjects: Systems and Control (eess.SY); Programming Languages (cs.PL)
This paper introduces the Precision-Timed Virtual Machine (PretVM), an intermediate platform facilitating the execution of quasi-static schedules compiled from a subset of programs written in the Lingua Franca (LF) coordination language. The subset consists of those programs that in principle should have statically verifiable and predictable timing behavior. The PretVM provides a schedule with well-defined worst-case timing bounds. The PretVM provides a clean separation between application logic and coordination logic, yielding more analyzable program executions. Experiments compare the PretVM against the default (more dynamic) LF scheduler and show that it delivers time-accurate deterministic execution.
- [82] arXiv:2406.06293 [pdf, html, other]
-
Title: Sample Rate Independent Recurrent Neural Networks for Audio Effects ProcessingComments: Accepted for publication in Proc. DAFx24, Guildford, UK, September 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
In recent years, machine learning approaches to modelling guitar amplifiers and effects pedals have been widely investigated and have become standard practice in some consumer products. In particular, recurrent neural networks (RNNs) are a popular choice for modelling non-linear devices such as vacuum tube amplifiers and distortion circuitry. One limitation of such models is that they are trained on audio at a specific sample rate and therefore give unreliable results when operating at another rate. Here, we investigate several methods of modifying RNN structures to make them approximately sample rate independent, with a focus on oversampling. In the case of integer oversampling, we demonstrate that a previously proposed delay-based approach provides high fidelity sample rate conversion whilst additionally reducing aliasing. For non-integer sample rate adjustment, we propose two novel methods and show that one of these, based on cubic Lagrange interpolation of a delay-line, provides a significant improvement over existing methods. To our knowledge, this work provides the first in-depth study into this problem.
- [83] arXiv:2406.06297 [pdf, html, other]
-
Title: Learning-based cognitive architecture for enhancing coordination in human groupsSubjects: Systems and Control (eess.SY)
As interactions with autonomous agents-ranging from robots in physical settings to avatars in virtual and augmented realities-become more prevalent, developing advanced cognitive architectures is critical for enhancing the dynamics of human-avatar groups. This paper presents a reinforcement-learning-based cognitive architecture, trained via a sim-to-real approach, designed to improve synchronization in periodic motor tasks, crucial for applications in group rehabilitation and sports training. Extensive numerical validation consistently demonstrates improvements in synchronization. Theoretical derivations and numerical investigations are complemented by preliminary experiments with real participants, showing that our avatars can integrate seamlessly into human groups, often being indistinguishable from humans.
- [84] arXiv:2406.06306 [pdf, html, other]
-
Title: Unified Fourier bases for GSP on stochastic block model graphsComments: 23 pagesSubjects: Signal Processing (eess.SP); Information Theory (cs.IT); Statistics Theory (math.ST)
We consider a recently proposed approach to graph signal processing based on graphons. We show how the graphon-based approach to GSP applies to graphs sampled from a stochastic block model. We obtain a basis for the graphon Fourier transform on such samples directly from the link probability matrix and the block sizes of the model. This formulation allows us to bound the sensitivity of the Fourier transform to small changes in block sizes. We then focus on the case where the probability matrix corresponds to a (weighted) Cayley graph. If block sizes are equal, a nice Fourier basis can be derived from the underlying group. We explore how, in the case where block sizes are not equal, some or all nice properties of the group basis can be maintained. We complement the theoretical results with simulations.
- [85] arXiv:2406.06343 [pdf, html, other]
-
Title: Thin Film Reconfigurable Intelligent Surface for Harmonic Beam SteeringComments: 5 pages, 4 figures, letterSubjects: Signal Processing (eess.SP); Systems and Control (eess.SY)
This letter explores the development and implementation of a novel thin film 1-by-4 reconfigurable intelligent surface (RIS) designed for future communication and sensing scenarios. Utilizing cost-effective inkjet printing methods and additive manufacturing, our approach significantly simplifies the RIS construction process and reduces production costs. The RIS, fabricated on a flexible and lightweight polyethylene terephthalate (PET) substrate, integrates antennas, switching circuitry, and a microcontroller unit (MCU). This setup enables individual and simultaneous control of each RIS element, manipulating the captured carrier signal by steering its dominant harmonics toward multiple desired directions. Measurement results of the beam steering show the manufactured RIS has the potential to enable RIS-aided communication and sensing applications.
- [86] arXiv:2406.06381 [pdf, html, other]
-
Title: Feature Characterization for Profile Surface TextureSubjects: Signal Processing (eess.SP)
Conventional field parameters for surface measurement use all data points, while feature characterization focuses on subsets extracted by watershed segmentation. This approach enables the extraction of specific features that are potentially responsible for the function of the surface or are a direct reflection of the manufacturing process, allowing for a more accurate assessment of both aspects. Feature characterization with the underlying watershed segmentation for areal surface topographies has been standardized for over a decade and is well established in industry and research. In contrast, feature characterization for surface profiles has been standardized recently, and the corresponding standard for watershed segmentation is planned to be published in the near future. Since the standards do not provide guidelines for implementation, this paper presents an unambiguous algorithm of the watershed segmentation and the feature characterization for surface profiles. This framework provides the basis for future work, mainly investigating the relationship between feature parameters based on feature characterization and the function of the surface or manufacturing process. For this purpose, recommendations for the configuration and extensions of the toolbox can also be developed, which could find their way into the ISO standards.
- [87] arXiv:2406.06392 [pdf, html, other]
-
Title: Tackling Delayed CSI in a Distributed Multi-Satellite MIMO Communication SystemSubjects: Signal Processing (eess.SP)
In this study, we explore the integration of satellites with ground-based communication networks. Specifically, we analyze downlink data transmission from a constellation of satellites to terrestrial users and address the issue of delayed channel state information (CSI). The satellites cooperate in data transmission within a cluster to create a unified, distributed massive multiple input, multiple output (MIMO) system. The CSI used for this process is inherently outdated, particularly due to the delay from the most distant satellite in the cluster. Therefore, in this paper, we develop a precoding strategy that leverages the long-term characteristics of CSI uncertainty to compensate for the undesirable impact of these unavoidable delays. Our proposed method is computationally efficient and particularly effective in lower frequency bands. As such, it holds significant promise for facilitating the integration of satellite and terrestrial communication, especially within frequency bands of up to 1 GHz.
- [88] arXiv:2406.06402 [pdf, html, other]
-
Title: Early Acceptance Matching Game for User-Centric Clustering in Scalable Cell-free MIMO NetworksComments: This work has been accepted for publication in 2024 European Conference on Networks and Communications (EuCNC) & 6G SummitSubjects: Signal Processing (eess.SP); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
The canonical setup is the primary approach adopted in cell-free multiple-input multiple-output (MIMO) networks, in which all access points (APs) jointly serve every user equipment (UE). This approach is not scalable in terms of computational complexity and fronthaul signaling becoming impractical in large networks. This work adopts a user-centric approach, a scalable alternative in which only a set of preferred APs jointly serve a UE. Forming the optimal cluster of APs for each UE is a challenging task, especially, when it needs to be dynamically adjusted to meet the quality of service (QoS) requirements of the UE. This complexity is even exacerbated when considering the constrained fronthaul capacity of the UE and the AP. We solve this problem with a novel many-to-many matching game. More specifically, we devise an early acceptance matching algorithm, which immediately admits or rejects UEs based on their requests and available radio resources. The proposed solution significantly reduces the fronthaul signaling while satisfying the maximum of UEs in terms of requested QoS compared to state-of-the-art approaches.
- [89] arXiv:2406.06404 [pdf, html, other]
-
Title: A LoRa-based Energy-efficient Sensing System for Urban Data CollectionSubjects: Systems and Control (eess.SY)
Nowadays, cities provide much more than shopping opportunities or working spaces. Individual locations such as parks and squares are used as meeting points and local recreation areas by many people. To ensure that they remain attractive in the future, the design of such squares must be regularly adapted to the needs of the public. These utilization trends can be derived using public data collection. The more diverse and rich the data sets are, the easier it is to optimize public space design through data analysis. Traditional data collection methods such as questionnaires, observations, or videos are either labor intensive or cannot guarantee to preserve the individual's privacy. This work presents a privacy-preserving, low-power, and low-cost smart sensing system that is capable of anonymously collecting data about public space utilization by analyzing the occupancy distribution of public seating. To support future urban planning the sensor nodes are capable of monitoring environmental noise, chair utilization, and their position, temperature, and humidity and provide them over a city-wide Long Range Wide Area Network (LoRaWAN). The final sensing system's robust operation is proven in a trial run at two public squares in a city with 16 sensor nodes over a duration of two months. By consuming 33.65 mWh per day with all subsystems enabled, including sitting detection based on a continuous acceleration measurement operating on a robust and simple threshold algorithm, the custom-designed sensor node achieves continuous monitoring during the 2-month trial run. The evaluation of the experimental results clearly shows how the two locations are used, which confirms the practicability of the proposed solution. All data collected during the field trial is publicly available as open data.
- [90] arXiv:2406.06434 [pdf, html, other]
-
Title: Spatiotemporal Graph Neural Network Modelling Perfusion MRIComments: 11 pages, 2 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Perfusion MRI (pMRI) offers valuable insights into tumor vascularity and promises to predict tumor genotypes, thus benefiting prognosis for glioma patients, yet effective models tailored to 4D pMRI are still lacking. This study presents the first attempt to model 4D pMRI using a GNN-based spatiotemporal model PerfGAT, integrating spatial information and temporal kinetics to predict Isocitrate DeHydrogenase (IDH) mutation status in glioma patients. Specifically, we propose a graph structure learning approach based on edge attention and negative graphs to optimize temporal correlations modeling. Moreover, we design a dual-attention feature fusion module to integrate spatiotemporal features while addressing tumor-related brain regions. Further, we develop a class-balanced augmentation methods tailored to spatiotemporal data, which could mitigate the common label imbalance issue in clinical datasets. Our experimental results demonstrate that the proposed method outperforms other state-of-the-art approaches, promising to model pMRI effectively for patient characterization.
New submissions for Tuesday, 11 June 2024 (showing 90 of 90 entries )
- [91] arXiv:2406.05137 (cross-list from cs.CR) [pdf, other]
-
Title: Multi-sensor Intrusion Detection SystemSubjects: Cryptography and Security (cs.CR); Systems and Control (eess.SY)
Security, defined as protection against external threats, is a critical concern for homes and offices. Intrusion, characterized by unauthorized access, presents a significant challenge to maintaining security. This research aims to address this issue by designing and implementing an automated intrusion detection system utilizing a combination of sensors and communication technologies.
The research introduced an automated intrusion detection system for homes and offices, combining sensors such as a PIR sensor for detecting unauthorized motion, magnetic switches for unauthorized entry detection, and a GSM module for notifying property owners. Employing the ATmega328P microcontroller, sensor data is analysed to generate early intrusion alerts, prompting phone call notifications via the GSM module. Practical implementation involved breadboarding, soldering, and rigorous testing, ensuring proper functionality under real-world conditions. The implemented intrusion detection system effectively utilizes magnetic switches and a Passive Infrared (PIR) sensor to detect unauthorized entry and motion within the premises, respectively. Upon detection, the system promptly analyses the situation and alerts the property owner via phone call, enabling swift response measures. This real-time notification system enhances proactive security management, minimizing the risk of further intrusion and ensuring the safety of the property. The multi-sensor intrusion detection system, incorporating PIR sensors, magnetic switches, and a GSM-based phone call gateway, effectively alerts property owners of unauthorized intrusions in real-time. Demonstrating its efficacy through rigorous testing, the system offers enhanced security for both residential and commercial environments. - [92] arXiv:2406.05152 (cross-list from cs.CV) [pdf, html, other]
-
Title: Fight Scene Detection for Movie Highlight Generation SystemSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
In this paper of a research based project, using Bidirectional Long Short-Term Memory (BiLSTM) networks, we provide a novel Fight Scene Detection (FSD) model which can be used for Movie Highlight Generation Systems (MHGS) based on deep learning and Neural Networks . Movies usually have Fight Scenes to keep the audience amazed. For trailer generation, or any other application of Highlight generation, it is very tidious to first identify all such scenes manually and then compile them to generate a highlight serving the purpose. Our proposed FSD system utilises temporal characteristics of the movie scenes and thus is capable to automatically identify fight scenes. Thereby helping in the effective production of captivating movie highlights. We observe that the proposed solution features 93.5% accuracy and is higher than 2D CNN with Hough Forests which being 92% accurate and is significantly higher than 3D CNN which features an accuracy of 65%.
- [93] arXiv:2406.05170 (cross-list from q-bio.OT) [pdf, other]
-
Title: Research on Tumors Segmentation based on Image Enhancement MethodSubjects: Other Quantitative Biology (q-bio.OT); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
One of the most effective ways to treat liver cancer is to perform precise liver resection surgery, the key step of which includes precise digital image segmentation of the liver and its tumor. However, traditional liver parenchymal segmentation techniques often face several challenges in performing liver segmentation: lack of precision, slow processing speed, and computational burden. These shortcomings limit the efficiency of surgical planning and execution. In this work, the model initially describes in detail a new image enhancement algorithm that enhances the key features of an image by adaptively adjusting the contrast and brightness of the image. Then, a deep learning-based segmentation network was introduced, which was specially trained on the enhanced images to optimize the detection accuracy of tumor regions. In addition, multi-scale analysis techniques have been incorporated into the study, allowing the model to analyze images at different resolutions to capture more nuanced tumor features. In the presentation of the experimental results, the study used the 3Dircadb dataset to test the effectiveness of the proposed method. The experimental results show that compared with the traditional image segmentation method, the new method using image enhancement technology has significantly improved the accuracy and recall rate of tumor identification.
- [94] arXiv:2406.05205 (cross-list from cs.CV) [pdf, html, other]
-
Title: CPLIP: Zero-Shot Learning for Histopathology with Comprehensive Vision-Language AlignmentSajid Javed, Arif Mahmood, Iyyakutti Iyappan Ganapathi, Fayaz Ali Dharejo, Naoufel Werghi, Mohammed BennamounSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
This paper proposes Comprehensive Pathology Language Image Pre-training (CPLIP), a new unsupervised technique designed to enhance the alignment of images and text in histopathology for tasks such as classification and segmentation. This methodology enriches vision-language models by leveraging extensive data without needing ground truth annotations. CPLIP involves constructing a pathology-specific dictionary, generating textual descriptions for images using language models, and retrieving relevant images for each text snippet via a pre-trained model. The model is then fine-tuned using a many-to-many contrastive learning method to align complex interrelated concepts across both modalities. Evaluated across multiple histopathology tasks, CPLIP shows notable improvements in zero-shot learning scenarios, outperforming existing methods in both interpretability and robustness and setting a higher benchmark for the application of vision-language models in the field. To encourage further research and replication, the code for CPLIP is available on GitHub at this https URL
- [95] arXiv:2406.05270 (cross-list from physics.med-ph) [pdf, other]
-
Title: fastMRI Breast: A publicly available radial k-space dataset of breast dynamic contrast-enhanced MRIEddy Solomon, Patricia M. Johnson, Zhengguo Tan, Radhika Tibrewala, Yvonne W. Lui, Florian Knoll, Linda Moy, Sungheon Gene Kim, Laura HeacockSubjects: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
This data curation work introduces the first large-scale dataset of radial k-space and DICOM data for breast DCE-MRI acquired in diagnostic breast MRI exams. Our dataset includes case-level labels indicating patient age, menopause status, lesion status (negative, benign, and malignant), and lesion type for each case. The public availability of this dataset and accompanying reconstruction code will support research and development of fast and quantitative breast image reconstruction and machine learning methods.
- [96] arXiv:2406.05305 (cross-list from cs.CV) [pdf, other]
-
Title: YouTube SFV+HDR Quality DatasetComments: Accepted by 2024 IEEE International Conference on Image ProcessingSubjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
The popularity of Short form videos (SFV) has grown dramatically in the past few years, and has become a phenomenal video category with billions of viewers. Meanwhile, High Dynamic Range (HDR) as an advanced feature also becomes more and more popular on video sharing platforms. As a hot topic with huge impact, SFV and HDR bring new questions to video quality research: 1) is SFV+HDR quality assessment significantly different from traditional User Generated Content (UGC) quality assessment? 2) do objective quality metrics designed for traditional UGC still work well for SFV+HDR? To answer the above questions, we created the first large scale SFV+HDR dataset with reliable subjective quality scores, covering 10 popular content categories. Further, we also introduce a general sampling framework to maximize the representativeness of the dataset. We provided a comprehensive analysis of subjective quality scores for Short form SDR and HDR videos, and discuss the reliability of state-of-the-art UGC quality metrics and potential improvements.
- [97] arXiv:2406.05370 (cross-list from cs.CL) [pdf, html, other]
-
Title: VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech SynthesizersSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases. The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis. Demos of VALL-E 2 will be posted to this https URL.
- [98] arXiv:2406.05395 (cross-list from cs.LG) [pdf, other]
-
Title: Dynamic importance learning using fisher information gain for nonlinear system identificationSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
The Fisher Information Matrix (FIM) provides a way for quantifying the information content of an observable random variable concerning unknown parameters within a model that characterizes the variable. When parameters in a model are directly linked to individual features, the diagonal elements of the FIM can signify the relative importance of each feature. However, in scenarios where feature interactions may exist, a comprehensive exploration of the full FIM is necessary rather than focusing solely on its diagonal elements. This paper presents an end-to-end black box system identification approach that integrates the FIM into the training process to gain insights into dynamic importance and overall model structure. A decision module is added to the first layer of the network to determine the relevance scores using the entire FIM as input. The forward propagation is then performed on element-wise multiplication of inputs and relevance scores. Simulation results demonstrate that the proposed methodology effectively captures various types of interactions between dynamics, outperforming existing methods limited to polynomial interactions. Moreover, the effectiveness of this novel approach is confirmed through its application in identifying a real-world industrial system, specifically the PH neutralization process.
- [99] arXiv:2406.05441 (cross-list from math.PR) [pdf, html, other]
-
Title: Two identities for Poisson Point Processes and Voronoi Tessellations with ApplicationsSubjects: Probability (math.PR); Signal Processing (eess.SP)
In this paper, we introduce two identities one pertaining to the state space of Poisson Point Processes (PPPs), and the other for the Voronoi tessellations formed by PPPs. Then, we explore several applications of these identities within the context of wireless cellular networks.
- [100] arXiv:2406.05464 (cross-list from cs.SD) [pdf, html, other]
-
Title: DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation ModelsComments: Accepted by Interspeech 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Self-supervised speech models have shown to be useful for various tasks, but their large size limits the use in devices with low computing power and memory. In this work, we explore early exit, an approach for reducing latency by exiting the forward process of a network early. Most approaches of early exit need a separate early exit model for each task, with some even requiring fine-tuning of the entire pretrained model. We introduce Data Adaptive Self-Supervised Early Exit (DAISY), an approach that decides when to exit based on the self-supervised loss, eliminating the need for multiple round of training and fine-tuning. DAISY matches the performance of HuBERT on the MiniSUPERB benchmark, but with much faster inference times. Our analysis on the adaptivity of DAISY shows that the model exits early (using fewer layers) on clean data while exits late (using more layers) on noisy data, dynamically adjusting the computational cost of inference based on the noise level of each sample.
- [101] arXiv:2406.05472 (cross-list from cs.CR) [pdf, html, other]
-
Title: A Novel Generative AI-Based Framework for Anomaly Detection in Multicast Messages in Smart Grid CommunicationsComments: 10 pages, 10 figures, Submitted to IEEE Transactions on Information Forensics and SecuritySubjects: Cryptography and Security (cs.CR); Systems and Control (eess.SY)
Cybersecurity breaches in digital substations can pose significant challenges to the stability and reliability of power system operations. To address these challenges, defense and mitigation techniques are required. Identifying and detecting anomalies in information and communication technology (ICT) is crucial to ensure secure device interactions within digital substations. This paper proposes a task-oriented dialogue (ToD) system for anomaly detection (AD) in datasets of multicast messages e.g., generic object oriented substation event (GOOSE) and sampled value (SV) in digital substations using large language models (LLMs). This model has a lower potential error and better scalability and adaptability than a process that considers the cybersecurity guidelines recommended by humans, known as the human-in-the-loop (HITL) process. Also, this methodology significantly reduces the effort required when addressing new cyber threats or anomalies compared with machine learning (ML) techniques, since it leaves the models complexity and precision unaffected and offers a faster implementation. These findings present a comparative assessment, conducted utilizing standard and advanced performance evaluation metrics for the proposed AD framework and the HITL process. To generate and extract datasets of IEC 61850 communications, a hardware-in-the-loop (HIL) testbed was employed.
- [102] arXiv:2406.05475 (cross-list from cs.CV) [pdf, html, other]
-
Title: HDRT: Infrared Capture for HDR ImagingSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
Capturing real world lighting is a long standing challenge in imaging and most practical methods acquire High Dynamic Range (HDR) images by either fusing multiple exposures, or boosting the dynamic range of Standard Dynamic Range (SDR) images. Multiple exposure capture is problematic as it requires longer capture times which can often lead to ghosting problems. The main alternative, inverse tone mapping is an ill-defined problem that is especially challenging as single captured exposures usually contain clipped and quantized values, and are therefore missing substantial amounts of content. To alleviate this, we propose a new approach, High Dynamic Range Thermal (HDRT), for HDR acquisition using a separate, commonly available, thermal infrared (IR) sensor. We propose a novel deep neural method (HDRTNet) which combines IR and SDR content to generate HDR images. HDRTNet learns to exploit IR features linked to the RGB image and the IR-specific parameters are subsequently used in a dual branch method that fuses features at shallow layers. This produces an HDR image that is significantly superior to that generated using naive fusion approaches. To validate our method, we have created the first HDR and thermal dataset, and performed extensive experiments comparing HDRTNet with the state-of-the-art. We show substantial quantitative and qualitative quality improvements on both over- and under-exposed images, showing that our approach is robust to capturing in multiple different lighting conditions.
- [103] arXiv:2406.05512 (cross-list from math.CO) [pdf, html, other]
-
Title: Optimal k-centers of a graph: a control-theoretic approachSubjects: Combinatorics (math.CO); Systems and Control (eess.SY)
In a network consisting of n nodes, our goal is to identify the most central k nodes with respect to the proposed definitions of centrality. Depending on the specific application, there exist several metrics for quantifying k-centrality, and the subset of the best k nodes naturally varies based on the chosen metric. In this paper, we propose two metrics and establish connections to a well-studied metric from the literature (specifically for stochastic matrices). We prove these three notions match for path graphs. We then list a few more control-theoretic notions and compare these various notions for a general randomly generated graph. Our first metric involves maximizing the shift in the smallest eigenvalue of the Laplacian matrix. This shift can be interpreted as an improvement in the time constant when the RC circuit experiences leakage at certain k capacitors. The second metric focuses on minimizing the Perron root of a principal sub-matrix of a stochastic matrix, an idea proposed and interpreted in the literature as manufacturing consent. The third one explores minimizing the Perron root of a perturbed (now super-stochastic) matrix, which can be seen as minimizing the impact of added stubbornness. It is important to emphasize that we consider applications (for example, facility location) when the notions of central ports are such that the set of the best k ports does not necessarily contain the set of the best k-1 ports. We apply our k-port selection metric to various network structures. Notably, we prove the equivalence of three definitions for a path graph and extend the concept of central port linkage beyond Fiedler vectors to other eigenvectors associated with path graphs.
- [104] arXiv:2406.05515 (cross-list from cs.SD) [pdf, html, other]
-
Title: Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlationPaige Tuttösí, H. Henny Yeung, Yue Wang, Fenqi Wang, Guillaume Denis, Jean-Julien Aucouturier, Angelica LimComments: Accepted to INTERSPEECH 2024Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Acoustic context effects, where surrounding changes in pitch, rate or timbre influence the perception of a sound, are well documented in speech perception, but how they interact with language background remains unclear. Using a reverse-correlation approach, we systematically varied the pitch and speech rate in phrases around different pairs of vowels for second language (L2) speakers of English (/i/-/I/) and French (/u/-/y/), thus reconstructing, in a data-driven manner, the prosodic profiles that bias their perception. Testing English and French speakers (n=25), we showed that vowel perception is in fact influenced by conflicting effects from the surrounding pitch and speech rate: a congruent proximal effect 0.2s pre-target and a distal contrastive effect up to 1s before; and found that L1 and L2 speakers exhibited strikingly similar prosodic profiles in perception. We provide a novel method to investigate acoustic context effects across stimuli, timescales, and acoustic domain.
- [105] arXiv:2406.05525 (cross-list from cs.ET) [pdf, html, other]
-
Title: Energy-Efficient Approximate Full Adders Applying Memristive Serial IMPLY Logic For Image ProcessingSubjects: Emerging Technologies (cs.ET); Image and Video Processing (eess.IV)
Researchers and designers are facing problems with memory and power walls, considering the pervasiveness of Von-Neumann architecture in the design of processors and the problems caused by reducing the dimensions of deep sub-micron transistors. Memristive Approximate Computing (AC) and In-Memory Processing (IMP) can be promising solutions to these problems. We have tried to solve power and memory wall problems by presenting the implementation algorithm of four memristive approximate full adders applying the Material Implication (IMPLY) method. The proposed circuits reduce the number of computational steps by up to 40% compared to State-of-the-art (SOA). The energy consumption of the proposed circuits improves over the previous exact ones by 49%-75% and over the approximate full adders by up to 41%. Multiple error evaluation criteria evaluate the computational accuracy of the proposed approximate full adders in three scenarios in the 8-bit approximate adder structure. The proposed approximate full adders are evaluated in three image processing applications in three scenarios. The results of application-level simulation indicate that the four proposed circuits can be applied in all three scenarios, considering the acceptable image quality metrics of the output images (the Peak Signal to Noise Ratio (PSNR) of the output images is greater than 30 dB).
- [106] arXiv:2406.05547 (cross-list from cs.SD) [pdf, html, other]
-
Title: Exploring the Benefits of Tokenization of Discrete Acoustic UnitsComments: Interspeech 2024Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Tokenization algorithms that merge the units of a base vocabulary into larger, variable-rate units have become standard in natural language processing tasks. This idea, however, has been mostly overlooked when the vocabulary consists of phonemes or Discrete Acoustic Units (DAUs), an audio-based representation that is playing an increasingly important role due to the success of discrete language-modeling techniques. In this paper, we showcase the advantages of tokenization of phonetic units and of DAUs on three prediction tasks: grapheme-to-phoneme, grapheme-to-DAUs, and unsupervised speech generation using DAU language modeling. We demonstrate that tokenization yields significant improvements in terms of performance, as well as training and inference speed, across all three tasks. We also offer theoretical insights to provide some explanation for the superior performance observed.
- [107] arXiv:2406.05575 (cross-list from cs.RO) [pdf, html, other]
-
Title: A Survey on Hybrid Motion Planning Methods for Automated Driving SystemsSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Motion planning is an essential element of the modular architecture of autonomous vehicles, serving as a bridge between upstream perception modules and downstream low-level control signals. Traditional motion planners were initially designed for specific Automated Driving Functions (ADFs), yet the evolving landscape of highly automated driving systems (ADS) requires motion for a wide range of ADFs, including unforeseen ones. This need has motivated the development of the ``hybrid" approach in the literature, seeking to enhance motion planning performance by combining diverse techniques, such as data-driven (learning-based) and logic-driven (analytic) methodologies. Recent research endeavours have significantly contributed to the development of more efficient, accurate, and safe hybrid methods for Tactical Decision Making (TDM) and Trajectory Generation (TG), as well as integrating these algorithms into the motion planning module. Owing to the extensive variety and potential of hybrid methods, a timely and comprehensive review of the current literature is undertaken in this survey article. We classify the hybrid motion planners based on the types of components they incorporate, such as combinations of sampling-based with optimization-based/learning-based motion planners. The comparison of different classes is conducted by evaluating the addressed challenges and limitations, as well as assessing whether they focus on TG and/or TDM. We hope this approach will enable the researchers in this field to gain in-depth insights into the identification of current trends in hybrid motion planning and shed light on promising areas for future research.
- [108] arXiv:2406.05576 (cross-list from cs.IT) [pdf, html, other]
-
Title: Uplink resource allocation optimization for user-centric cell-free MIMO networksComments: To appear in IEEE Transactions on Wireless Communications, 16 pages, 9 figuresSubjects: Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP); Systems and Control (eess.SY); Optimization and Control (math.OC)
We examine the problem of optimizing resource allocation in the uplink for a user-centric, cell-free, multi-input multi-output network. We start by modeling and developing resource allocation algorithms for two standard network operation modes. The centralized mode provides high data rates but suffers multiple issues, including scalability. On the other hand, the distributed mode has the opposite problem: relatively low rates, but is scalable. To address these challenges, we combine the strength of the two standard modes, creating a new semi-distributed operation mode. To avoid the need for information exchange between access points, we introduce a new quality of service metric to decentralize the resource allocation algorithms. Our results show that we can eliminate the need for information exchange with a relatively small penalty on data rates.
- [109] arXiv:2406.05629 (cross-list from cs.CV) [pdf, html, other]
-
Title: Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and LanguageComments: Computer Vision and Pattern Recognition 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the ``meaning'' of words and the ``location'' of sounds without explicit localization supervision. Furthermore, it automatically discovers and distinguishes between these two types of associations without supervision. We show that DenseAV's localization abilities arise from a new multi-head feature aggregation operator that directly compares dense image and audio representations for contrastive learning. In contrast, many other systems that learn ``global'' audio and video representations cannot localize words and sound. Finally, we contribute two new datasets to improve the evaluation of AV representations through speech and sound prompted semantic segmentation. On these and other datasets we show DenseAV dramatically outperforms the prior art on speech and sound prompted semantic segmentation. DenseAV outperforms the previous state-of-the-art, ImageBind, on cross-modal retrieval using fewer than half of the parameters. Project Page: \href{this https URL}{this https URL}
- [110] arXiv:2406.05632 (cross-list from math.OC) [pdf, html, other]
-
Title: Best Response Strategies for Asymmetric Sensing in Linear-Quadratic Differential GamesComments: Accepted to IEEE L-CSSSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
In this paper, we revisit the two-player continuous-time infinite-horizon linear quadratic differential game problem, where one of the players can sample the state of the system only intermittently due to a sensing constraint while the other player can do so continuously. Under these asymmetric sensing limitations between the players, we analyze the optimal sensing and control strategies for the player at a disadvantage while the other player continues to play its security strategy. We derive an optimal sensor policy within the class of stationary randomized policies. Finally, using simulations, we show that the expected cost accrued by the first player approaches its security level as its sensing limitation is relaxed.
- [111] arXiv:2406.05653 (cross-list from cs.SD) [pdf, html, other]
-
Title: Heart Sound Segmentation Using Deep Learning TechniquesSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Heart disease remains a leading cause of mortality worldwide. Auscultation, the process of listening to heart sounds, can be enhanced through computer-aided analysis using Phonocardiogram (PCG) signals. This paper presents a novel approach for heart sound segmentation and classification into S1 (LUB) and S2 (DUB) sounds. We employ FFT-based filtering, dynamic programming for event detection, and a Siamese network for robust classification. Our method demonstrates superior performance on the PASCAL heart sound dataset compared to existing approaches.
- [112] arXiv:2406.05681 (cross-list from cs.SD) [pdf, html, other]
-
Title: Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody ModelingComments: 5 pages, 2 figures, accepted by Interspeech2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while guiding prosody modeling. Besides, given that prosody contains both global consistency and local variations, we introduce a diffusion model as the pitch predictor and employ a prosody adaptor to model prosody hierarchically, further enhancing the prosody quality of the synthesized speech. Experimental results show that our model not only maintains comparable timbre quality to the baseline but also exhibits better naturalness and expressiveness.
- [113] arXiv:2406.05692 (cross-list from cs.SD) [pdf, html, other]
-
Title: SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice ConversionComments: Accepted by Interspeech 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Diffusion-based singing voice conversion (SVC) models have shown better synthesis quality compared to traditional methods. However, in cross-domain SVC scenarios, where there is a significant disparity in pitch between the source and target voice domains, the models tend to generate audios with hoarseness, posing challenges in achieving high-quality vocal outputs. Therefore, in this paper, we propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC), which can enhance the voice quality in SVC tasks without requiring additional data or increasing model parameters. We innovatively introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance. Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance in both general SVC scenarios and particularly in cross-domain SVC scenarios.
- [114] arXiv:2406.05700 (cross-list from cs.CV) [pdf, html, other]
-
Title: HDMba: Hyperspectral Remote Sensing Imagery Dehazing with State Space ModelSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Haze contamination in hyperspectral remote sensing images (HSI) can lead to spatial visibility degradation and spectral distortion. Haze in HSI exhibits spatial irregularity and inhomogeneous spectral distribution, with few dehazing networks available. Current CNN and Transformer-based dehazing methods fail to balance global scene recovery, local detail retention, and computational efficiency. Inspired by the ability of Mamba to model long-range dependencies with linear complexity, we explore its potential for HSI dehazing and propose the first HSI Dehazing Mamba (HDMba) network. Specifically, we design a novel window selective scan module (WSSM) that captures local dependencies within windows and global correlations between windows by partitioning them. This approach improves the ability of conventional Mamba in local feature extraction. By modeling the local and global spectral-spatial information flow, we achieve a comprehensive analysis of hazy regions. The DehazeMamba layer (DML), constructed by WSSM, and residual DehazeMamba (RDM) blocks, composed of DMLs, are the core components of the HDMba framework. These components effectively characterize the complex distribution of haze in HSIs, aiding in scene reconstruction and dehazing. Experimental results on the Gaofen-5 HSI dataset demonstrate that HDMba outperforms other state-of-the-art methods in dehazing performance. The code will be available at this https URL.
- [115] arXiv:2406.05708 (cross-list from cs.RO) [pdf, html, other]
-
Title: Towards A General-Purpose Motion Planning for Autonomous Vehicles Using Fluid DynamicsSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
General-purpose motion planners for automated/autonomous vehicles promise to handle the task of motion planning (including tactical decision-making and trajectory generation) for various automated driving functions (ADF) in a diverse range of operational design domains (ODDs). The challenges of designing a general-purpose motion planner arise from several factors: a) A plethora of scenarios with different semantic information in each driving scene should be addressed, b) a strong coupling between long-term decision-making and short-term trajectory generation shall be taken into account, c) the nonholonomic constraints of the vehicle dynamics must be considered, and d) the motion planner must be computationally efficient to run in real-time. The existing methods in the literature are either limited to specific scenarios (logic-based) or are data-driven (learning-based) and therefore lack explainability, which is important for safety-critical automated driving systems (ADS). This paper proposes a novel general-purpose motion planning solution for ADS inspired by the theory of fluid mechanics. A computationally efficient technique, i.e., the lattice Boltzmann method, is then adopted to generate a spatiotemporal vector field, which in accordance with the nonholonomic dynamic model of the Ego vehicle is employed to generate feasible candidate trajectories. The trajectory optimising ride quality, efficiency and safety is finally selected to calculate the imminent control signals, i.e., throttle/brake and steering angle. The performance of the proposed approach is evaluated by simulations in highway driving, on-ramp merging, and intersection crossing scenarios, and it is found to outperform traditional motion planning solutions based on model predictive control (MPC).
- [116] arXiv:2406.05726 (cross-list from cs.CV) [pdf, html, other]
-
Title: Region of Interest Loss for Anonymizing Learned Image CompressionComments: Accepted to IEEE CASE 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
The use of AI in public spaces continually raises concerns about privacy and the protection of sensitive data. An example is the deployment of detection and recognition methods on humans, where images are provided by surveillance cameras. This results in the acquisition of great amounts of sensitive data, since the capture and transmission of images taken by such cameras happens unaltered, for them to be received by a server on the network. However, many applications do not explicitly require the identity of a given person in a scene; An anonymized representation containing information of the person's position while preserving the context of them in the scene suffices. We show how using a customized loss function on region of interests (ROI) can achieve sufficient anonymization such that human faces become unrecognizable while persons are kept detectable, by training an end-to-end optimized autoencoder for learned image compression that utilizes the flexibility of the learned analysis and reconstruction transforms for the task of mutating parts of the compression result. This approach enables compression and anonymization in one step on the capture device, instead of transmitting sensitive, nonanonymized data over the network. Additionally, we evaluate how this anonymization impacts the average precision of pre-trained foundation models on detecting faces (MTCNN) and humans (YOLOv8) in comparison to non-ANN based methods, while considering compression rate and latency.
- [117] arXiv:2406.05747 (cross-list from cs.IT) [pdf, html, other]
-
Title: Rapid Optimization of Superposition Codes for Multi-Hop NOMA MANETs via Deep UnfoldingComments: Under review for publication in the IEEESubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Various communication technologies are expected to utilize mobile ad hoc networks (MANETs). By combining MANETs with non-orthogonal multiple access (NOMA) communications, one can support scalable, spectrally efficient, and flexible network topologies. To achieve these benefits of NOMA MANETs, one should determine the transmission protocol, particularly the superposition code. However, the latter involves lengthy optimization that has to be repeated when the topology changes. In this work, we propose an algorithm for rapidly optimizing superposition codes in multi-hop NOMA MANETs. To achieve reliable tunning with few iterations, we adopt the emerging deep unfolding methodology, leveraging data to boost reliable settings. Our superposition coding optimization algorithm utilizes a small number of projected gradient steps while learning its per-user hyperparameters to maximize the minimal rate over past channels in an unsupervised manner. The learned optimizer is designed for both settings with full channel state information, as well as when the channel coefficients are to be estimated from pilots. We show that the combination of principled optimization and machine learning yields a scalable optimizer, that once trained, can be applied to different topologies. We cope with the non-convex nature of the optimization problem by applying parallel-learned optimization with different starting points as a form of ensemble learning. Our numerical results demonstrate that the proposed method enables the rapid setting of high-rate superposition codes for various channels.
- [118] arXiv:2406.05784 (cross-list from cs.SD) [pdf, html, other]
-
Title: Optimizing Multi-Stuttered Speech Classification: Leveraging Whisper's Encoder for Efficient Parameter Reduction in Automated AssessmentSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
The automated classification of stuttered speech has significant implications for timely assessments providing assistance to speech language pathologists. Despite notable advancements in the field, the cases in which multiple disfluencies occur in speech require attention. We have taken a progressive approach to fill this gap by classifying multi-stuttered speech more efficiently. The problem has been addressed by firstly curating a dataset of multi-stuttered disfluencies from SEP-28k audio clips. Secondly, employing Whisper, a state-of-the-art speech recognition model has been leveraged by using its encoder and taking the problem as multi-label classification. Thirdly, using a 6 encoder layer Whisper and experimenting with various layer freezing strategies, a computationally efficient configuration of the model was identified. The proposed configuration achieved micro, macro, and weighted F1- scores of 0.88, 0.85, and 0.87, correspondingly on an external test dataset i.e. Fluency-Bank. In addition, through layer freezing strategies, we were able to achieve the aforementioned results by fine-tuning a single encoder layer, consequently, reducing the model's trainable parameters from 20.27 million to 3.29 million. This research study unveils the contribution of the last encoder layer in the identification of disfluencies in stuttered speech. Consequently, it has led to a computationally efficient approach which makes the model more adaptable for various dialects and languages.
- [119] arXiv:2406.05806 (cross-list from cs.CL) [pdf, html, other]
-
Title: Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of WhisperComments: In progressSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This research explores the interaction between Whisper, a high-performing speech recognition model, and information in prompts. Our results unexpectedly show that Whisper may not fully grasp textual prompts as anticipated. Additionally, we find that performance improvement is not guaranteed even with stronger adherence to the topic information in textual prompts. It is also noted that English prompts generally outperform Mandarin ones on datasets of both languages, likely due to differences in training data distributions for these languages. Conversely, we discover that Whisper exhibits awareness of misleading information in language tokens by effectively ignoring incorrect language tokens and focusing on the correct ones. In summary, this work raises questions about Whisper's prompt understanding capability and encourages further studies.
- [120] arXiv:2406.05828 (cross-list from cs.CV) [pdf, html, other]
-
Title: Multi-Stain Multi-Level Convolutional Network for Multi-Tissue Breast Cancer Image SegmentationAkash Modi, Sumit Kumar Jha, Purnendu Mishra, Rajiv Kumar, Kiran Aatre, Gursewak Singh, Shubham MathurSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Digital pathology and microscopy image analysis are widely employed in the segmentation of digitally scanned IHC slides, primarily to identify cancer and pinpoint regions of interest (ROI) indicative of tumor presence. However, current ROI segmentation models are either stain-specific or suffer from the issues of stain and scanner variance due to different staining protocols or modalities across multiple labs. Also, tissues like Ductal Carcinoma in Situ (DCIS), acini, etc. are often classified as Tumors due to their structural similarities and color compositions. In this paper, we proposed a novel convolutional neural network (CNN) based Multi-class Tissue Segmentation model for histopathology whole-slide Breast slides which classify tumors and segments other tissue regions such as Ducts, acini, DCIS, Squamous epithelium, Blood Vessels, Necrosis, etc. as a separate class. Our unique pixel-aligned non-linear merge across spatial resolutions empowers models with both local and global fields of view for accurate detection of various classes. Our proposed model is also able to separate bad regions such as folds, artifacts, blurry regions, bubbles, etc. from tissue regions using multi-level context from different resolutions of WSI. Multi-phase iterative training with context-aware augmentation and increasing noise was used to efficiently train a multi-stain generic model with partial and noisy annotations from 513 slides. Our training pipeline used 12 million patches generated using context-aware augmentations which made our model stain and scanner invariant across data sources. To extrapolate stain and scanner invariance, our model was evaluated on 23000 patches which were for a completely new stain (Hematoxylin and Eosin) from a completely new scanner (Motic) from a different lab. The mean IOU was 0.72 which is on par with model performance on other data sources and scanners.
- [121] arXiv:2406.05863 (cross-list from cs.SD) [pdf, other]
-
Title: Source -Free Domain Adaptation for Speaker Verification in Data-Scarce Languages and Noisy ChannelsSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Domain adaptation is often hampered by exceedingly small target datasets and inaccessible source data. These conditions are prevalent in speech verification, where privacy policies and/or languages with scarce speech resources limit the availability of sufficient data. This paper explored techniques of sourcefree domain adaptation unto a limited target speech dataset for speaker verificationin data-scarce languages. Both language and channel mis-match between source and target were investigated. Fine-tuning methods were evaluated and compared across different sizes of labeled target data. A novel iterative cluster-learn algorithm was studied for unlabeled target datasets.
- [122] arXiv:2406.05876 (cross-list from cs.CL) [pdf, html, other]
-
Title: Zero-Shot End-To-End Spoken Question Answering In Medical DomainComments: Accepted to INTERSPEECH 2024Journal-ref: InterSpeech 2024Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
In the rapidly evolving landscape of spoken question-answering (SQA), the integration of large language models (LLMs) has emerged as a transformative development. Conventional approaches often entail the use of separate models for question audio transcription and answer selection, resulting in significant resource utilization and error accumulation. To tackle these challenges, we explore the effectiveness of end-to-end (E2E) methodologies for SQA in the medical domain. Our study introduces a novel zero-shot SQA approach, compared to traditional cascade systems. Through a comprehensive evaluation conducted on a new open benchmark of 8 medical tasks and 48 hours of synthetic audio, we demonstrate that our approach requires up to 14.7 times fewer resources than a combined 1.3B parameters LLM with a 1.55B parameters ASR model while improving average accuracy by 0.5\%. These findings underscore the potential of E2E methodologies for SQA in resource-constrained contexts.
- [123] arXiv:2406.05900 (cross-list from cs.LG) [pdf, html, other]
-
Title: Large Language Models Memorize Sensor Datasets! Implications on Human Activity Recognition ResearchSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
The astonishing success of Large Language Models (LLMs) in Natural Language Processing (NLP) has spurred their use in many application domains beyond text analysis, including wearable sensor-based Human Activity Recognition (HAR). In such scenarios, often sensor data are directly fed into an LLM along with text instructions for the model to perform activity classification. Seemingly remarkable results have been reported for such LLM-based HAR systems when they are evaluated on standard benchmarks from the field. Yet, we argue, care has to be taken when evaluating LLM-based HAR systems in such a traditional way. Most contemporary LLMs are trained on virtually the entire (accessible) internet -- potentially including standard HAR datasets. With that, it is not unlikely that LLMs actually had access to the test data used in such benchmark experiments.The resulting contamination of training data would render these experimental evaluations meaningless. In this paper we investigate whether LLMs indeed have had access to standard HAR datasets during training. We apply memorization tests to LLMs, which involves instructing the models to extend given snippets of data. When comparing the LLM-generated output to the original data we found a non-negligible amount of matches which suggests that the LLM under investigation seems to indeed have seen wearable sensor data from the benchmark datasets during training. For the Daphnet dataset in particular, GPT-4 is able to reproduce blocks of sensor readings. We report on our investigations and discuss potential implications on HAR research, especially with regards to reporting results on experimental evaluation
- [124] arXiv:2406.05913 (cross-list from cs.NI) [pdf, html, other]
-
Title: Revisiting Multi-User Downlink in IEEE 802.11ax: A Designers Guide to MU-MIMOComments: This work has been submitted to the IEEE for possible publication. 7 pages, 6 figures, magazine paperSubjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
Downlink (DL) Multi-User (MU) Multiple Input Multiple Output (MU-MIMO) is a key technology that allows multiple concurrent data transmissions from an Access Point (AP) to a selected sub-set of clients for higher network efficiency in IEEE 802.11ax. However, DL MU-MIMO feature is typically turned off as the default setting in AP vendors' products, that is, turning on the DL MU-MIMO may not help increase the network efficiency, which is counter-intuitive. In this article, we provide a sufficiently deep understanding of the interplay between the various underlying factors, i.e., CSI overhead and spatial correlation, which result in negative results when turning on the DL MU-MIMO. Furthermore, we provide a fundamental guideline as a function of operational scenarios to address the fundamental question "when the DL MU-MIMO should be turned on/off".
- [125] arXiv:2406.05915 (cross-list from cs.CV) [pdf, html, other]
-
Title: Bits-to-Photon: End-to-End Learned Scalable Point Cloud Compression for Direct RenderingSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Point cloud is a promising 3D representation for volumetric streaming in emerging AR/VR applications. Despite recent advances in point cloud compression, decoding and rendering high-quality images from lossy compressed point clouds is still challenging in terms of quality and complexity, making it a major roadblock to achieve real-time 6-Degree-of-Freedom video streaming. In this paper, we address this problem by developing a point cloud compression scheme that generates a bit stream that can be directly decoded to renderable 3D Gaussians. The encoder and decoder are jointly optimized to consider both bit-rates and rendering quality. It significantly improves the rendering quality while substantially reducing decoding and rendering time, compared to existing point cloud compression methods. Furthermore, the proposed scheme generates a scalable bit stream, allowing multiple levels of details at different bit-rate ranges. Our method supports real-time color decoding and rendering of high quality point clouds, thus paving the way for interactive 3D streaming applications with free view points.
- [126] arXiv:2406.05916 (cross-list from quant-ph) [pdf, html, other]
-
Title: Reforming Quantum Microgrid FormationSubjects: Quantum Physics (quant-ph); Systems and Control (eess.SY)
This letter introduces a novel compact and lossless quantum microgrid formation (qMGF) approach to achieve efficient operational optimization of the power system and improvement of resilience. This is achieved through lossless reformulation to ensure that the results are equivalent to those produced by the classical MGF by exploiting graph-theory-empowered quadratic unconstrained binary optimization (QUBO) that avoids the need for redundant encoding of continuous variables. Additionally, the qMGF approach utilizes a compact formulation that requires significantly fewer qubits compared to other quantum methods thereby enabling a high-accuracy and low-complexity deployment of qMGF on near-term quantum computers. Case studies on real quantum processing units (QPUs) empirically demonstrated that qMGF can achieve the same high accuracy as classic results with a significantly reduced number of qubits.
- [127] arXiv:2406.05923 (cross-list from cs.SD) [pdf, html, other]
-
Title: Contrastive Learning from Synthetic Audio DoppelgangersComments: 17 pages, 6 figuresSubjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Learning robust audio representations currently demands extensive datasets of real-world sound recordings. By applying artificial transformations to these recordings, models can learn to recognize similarities despite subtle variations through techniques like contrastive learning. However, these transformations are only approximations of the true diversity found in real-world sounds, which are generated by complex interactions of physical processes, from vocal cord vibrations to the resonance of musical instruments. We propose a solution to both the data scale and transformation limitations, leveraging synthetic audio. By randomly perturbing the parameters of a sound synthesizer, we generate audio doppelgängers-synthetic positive pairs with causally manipulated variations in timbre, pitch, and temporal envelopes. These variations, difficult to achieve through transformations of existing audio, provide a rich source of contrastive information. Despite the shift to randomly generated synthetic data, our method produces strong representations, competitive with real data on standard audio classification benchmarks. Notably, our approach is lightweight, requires no data storage, and has only a single hyperparameter, which we extensively analyze. We offer this method as a complement to existing strategies for contrastive learning in audio, using synthesized sounds to reduce the data burden on practitioners.
- [128] arXiv:2406.05954 (cross-list from cs.AI) [pdf, html, other]
-
Title: Aligning Large Language Models with Representation Editing: A Control PerspectiveLingkai Kong, Haorui Wang, Wenhao Mu, Yuanqi Du, Yuchen Zhuang, Yifei Zhou, Yue Song, Rongzhi Zhang, Kai Wang, Chao ZhangSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Aligning large language models (LLMs) with human objectives is crucial for real-world applications. However, fine-tuning LLMs for alignment often suffers from unstable training and requires substantial computing resources. Test-time alignment techniques, such as prompting and guided decoding, do not modify the underlying model, and their performance remains dependent on the original model's capabilities. To address these challenges, we propose aligning LLMs through representation editing. The core of our method is to view a pre-trained autoregressive LLM as a discrete-time stochastic dynamical system. To achieve alignment for specific objectives, we introduce external control signals into the state space of this language dynamical system. We train a value function directly on the hidden states according to the Bellman equation, enabling gradient-based optimization to obtain the optimal control signals at test time. Our experiments demonstrate that our method outperforms existing test-time alignment techniques while requiring significantly fewer resources compared to fine-tuning methods.
- [129] arXiv:2406.06002 (cross-list from cs.LG) [pdf, html, other]
-
Title: Computational and Statistical Guarantees for Tensor-on-Tensor Regression with Tensor Train DecompositionComments: arXiv admin note: text overlap with arXiv:2401.02592Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
Recently, a tensor-on-tensor (ToT) regression model has been proposed to generalize tensor recovery, encompassing scenarios like scalar-on-tensor regression and tensor-on-vector regression. However, the exponential growth in tensor complexity poses challenges for storage and computation in ToT regression. To overcome this hurdle, tensor decompositions have been introduced, with the tensor train (TT)-based ToT model proving efficient in practice due to reduced memory requirements, enhanced computational efficiency, and decreased sampling complexity. Despite these practical benefits, a disparity exists between theoretical analysis and real-world performance. In this paper, we delve into the theoretical and algorithmic aspects of the TT-based ToT regression model. Assuming the regression operator satisfies the restricted isometry property (RIP), we conduct an error analysis for the solution to a constrained least-squares optimization problem. This analysis includes upper error bound and minimax lower bound, revealing that such error bounds polynomially depend on the order $N+M$. To efficiently find solutions meeting such error bounds, we propose two optimization algorithms: the iterative hard thresholding (IHT) algorithm (employing gradient descent with TT-singular value decomposition (TT-SVD)) and the factorization approach using the Riemannian gradient descent (RGD) algorithm. When RIP is satisfied, spectral initialization facilitates proper initialization, and we establish the linear convergence rate of both IHT and RGD.
- [130] arXiv:2406.06005 (cross-list from cs.RO) [pdf, html, other]
-
Title: WoCoCo: Learning Whole-Body Humanoid Control with Sequential ContactsComments: Website and Videos: this https URLSubjects: Robotics (cs.RO); Graphics (cs.GR); Systems and Control (eess.SY)
Humanoid activities involving sequential contacts are crucial for complex robotic interactions and operations in the real world and are traditionally solved by model-based motion planning, which is time-consuming and often relies on simplified dynamics models. Although model-free reinforcement learning (RL) has become a powerful tool for versatile and robust whole-body humanoid control, it still requires tedious task-specific tuning and state machine design and suffers from long-horizon exploration issues in tasks involving contact sequences. In this work, we propose WoCoCo (Whole-Body Control with Sequential Contacts), a unified framework to learn whole-body humanoid control with sequential contacts by naturally decomposing the tasks into separate contact stages. Such decomposition facilitates simple and general policy learning pipelines through task-agnostic reward and sim-to-real designs, requiring only one or two task-related terms to be specified for each task. We demonstrated that end-to-end RL-based controllers trained with WoCoCo enable four challenging whole-body humanoid tasks involving diverse contact sequences in the real world without any motion priors: 1) versatile parkour jumping, 2) box loco-manipulation, 3) dynamic clap-and-tap dancing, and 4) cliffside climbing. We further show that WoCoCo is a general framework beyond humanoid by applying it in 22-DoF dinosaur robot loco-manipulation tasks.
- [131] arXiv:2406.06041 (cross-list from cs.GT) [pdf, html, other]
-
Title: Risk Sensitivity in Markov Games and Multi-Agent Reinforcement Learning: A Systematic ReviewComments: 14 pages, 2 figures, 1 tableSubjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
Markov games (MGs) and multi-agent reinforcement learning (MARL) are studied to model decision making in multi-agent systems. Traditionally, the objective in MG and MARL has been risk-neutral, i.e., agents are assumed to optimize a performance metric such as expected return, without taking into account subjective or cognitive preferences of themselves or of other agents. However, ignoring such preferences leads to inaccurate models of decision making in many real-world scenarios in finance, operations research, and behavioral economics. Therefore, when these preferences are present, it is necessary to incorporate a suitable measure of risk into the optimization objective of agents, which opens the door to risk-sensitive MG and MARL. In this paper, we systemically review the literature on risk sensitivity in MG and MARL that has been growing in recent years alongside other areas of reinforcement learning and game theory. We define and mathematically describe different risk measures used in MG and MARL and individually for each measure, discuss articles that incorporate it. Finally, we identify recent trends in theoretical and applied works in the field and discuss possible directions of future research.
- [132] arXiv:2406.06064 (cross-list from cs.IT) [pdf, html, other]
-
Title: 6DMA Enhanced Wireless Network with Flexible Antenna Position and Rotation: Opportunities and ChallengesComments: 8 pages, 5 figures, 1 tableSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
6DMA (six-dimensional movable antenna) is a new and revolutionizing technology that fully exploits the wireless channel spatial variation at the transmitter/receiver by flexibly adjusting the three-dimensional (3D) positions and 3D rotations of distributed antennas/antenna surfaces (arrays). In this article, we provide an overview of 6DMA for unveiling its great potential in wireless networks, including its motivation and competitive advantages over existing technologies, system/channel modeling, and practical implementation. In particular, we present a variety of 6DMA-enabled performance enhancement in terms of array gain, spatial multiplexing, interference suppression, and geometric gain. Furthermore, we illustrate the main applications of 6DMA in wireless communication and sensing, and elaborate their design challenges as well as promising solutions. Finally, numerical results are provided to demonstrate the significant capacity improvement of 6DMA-aided communication in wireless network.
- [133] arXiv:2406.06086 (cross-list from cs.SD) [pdf, html, other]
-
Title: RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake DetectionYujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng, Jianhua Tao, Lv Zhao, Cunhang FanComments: Accepted by Interspeech 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Fake artefacts for discriminating between bonafide and fake audio can exist in both short- and long-range segments. Therefore, combining local and global feature information can effectively discriminate between bonafide and fake audio. This paper proposes an end-to-end bidirectional state space model, named RawBMamba, to capture both short- and long-range discriminative information for audio deepfake detection. Specifically, we use sinc Layer and multiple convolutional layers to capture short-range features, and then design a bidirectional Mamba to address Mamba's unidirectional modelling problem and further capture long-range feature information. Moreover, we develop a bidirectional fusion module to integrate embeddings, enhancing audio context representation and combining short- and long-range information. The results show that our proposed RawBMamba achieves a 34.1\% improvement over Rawformer on ASVspoof2021 LA dataset, and demonstrates competitive performance on other datasets.
- [134] arXiv:2406.06097 (cross-list from cs.SD) [pdf, other]
-
Title: StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History SelectionComments: Accepted at ACL 2024 main conferenceSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Streaming speech-to-text translation (StreamST) is the task of automatically translating speech while incrementally receiving an audio stream. Unlike simultaneous ST (SimulST), which deals with pre-segmented speech, StreamST faces the challenges of handling continuous and unbounded audio streams. This requires additional decisions about what to retain of the previous history, which is impractical to keep entirely due to latency and computational constraints. Despite the real-world demand for real-time ST, research on streaming translation remains limited, with existing works solely focusing on SimulST. To fill this gap, we introduce StreamAtt, the first StreamST policy, and propose StreamLAAL, the first StreamST latency metric designed to be comparable with existing metrics for SimulST. Extensive experiments across all 8 languages of MuST-C v1.0 show the effectiveness of StreamAtt compared to a naive streaming baseline and the related state-of-the-art SimulST policy, providing a first step in StreamST research.
- [135] arXiv:2406.06139 (cross-list from cs.SD) [pdf, html, other]
-
Title: Thunder : Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian BridgeComments: 5 pages, 3 figures, 4 tables, This paper will be submitted in the interspeech conferenceSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Diffusion-based speech enhancement has shown promising results, but can suffer from a slower inference time. Initializing the diffusion process with the enhanced audio generated by a regression-based model can be used to reduce the computational steps required. However, these approaches often necessitate a regression model, further increasing the system's complexity. We propose Thunder, a unified regression-diffusion model that utilizes the Brownian bridge process which can allow the model to act in both modes. The regression mode can be accessed by setting the diffusion time step closed to 1. However, the standard score-based diffusion modeling does not perform well in this setup due to gradient instability. To mitigate this problem, we modify the diffusion model to predict the clean speech instead of the score function, achieving competitive performance with a more compact model size and fewer reverse steps.
- [136] arXiv:2406.06208 (cross-list from cs.SD) [pdf, html, other]
-
Title: Quantifying the effect of speech pathology on automatic and human speaker verificationBence Mark Halpern, Thomas Tienkamp, Wen-Chin Huang, Lester Phillip Violeta, Teja Rebernik, Sebastiaan de Visscher, Max Witjes, Martijn Wieling, Defne Abur, Tomoki TodaComments: 5 pages, 2 figures, 2 tables. Accepted to Interspeech 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
This study investigates how surgical intervention for speech pathology (specifically, as a result of oral cancer surgery) impacts the performance of an automatic speaker verification (ASV) system. Using two recently collected Dutch datasets with parallel pre and post-surgery audio from the same speaker, NKI-OC-VC and SPOKE, we assess the extent to which speech pathology influences ASV performance, and whether objective/subjective measures of speech severity are correlated with the performance. Finally, we carry out a perceptual study to compare judgements of ASV and human listeners. Our findings reveal that pathological speech negatively affects ASV performance, and the severity of the speech is negatively correlated with the performance. There is a moderate agreement in perceptual and objective scores of speaker similarity and severity, however, we could not clearly establish in the perceptual study, whether the same phenomenon also exists in human perception.
- [137] arXiv:2406.06280 (cross-list from cs.IT) [pdf, html, other]
-
Title: Optimal sensing policy with interference-model uncertaintyComments: Submitted to IEEE communications lettersSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Assume that an interferer behaves according to a parametric model but one does not know the value of the model parameters. Sensing enables to improve the model knowledge and therefore perform a better link adaptation. However, we consider a half-duplex scenario where, at each time slot, the communication system should decide between sensing and communication. We thus propose to investigate the optimal policy to maximize the expected sum rate given a finite-time communication. % the following question therefore arises: At a given time slot, should one sense or communicate? We first show that this problem can be modelled in the Markov decision process (MDP) framework. We then demonstrate that the optimal open-loop and closed-loop policies can be found significantly faster than the standard backward-induction algorithm.
- [138] arXiv:2406.06284 (cross-list from cs.IT) [pdf, html, other]
-
Title: An ODMA-Based Unsourced Random Access Scheme with a Multiple Antenna ReceiverSubjects: Information Theory (cs.IT); Systems and Control (eess.SY)
We investigate the unsourced random access scheme assuming that the base station is equipped with multiple antennas, and propose a high-performing solution utilizing on-off-division multiple access. We assume that each user spreads its pilot sequence and polar codeword to the pilot and data parts of the transmission frame, respectively, based on a transmission pattern. The iterative receiver operation consists of pilot and pattern detection followed by channel vector and symbol estimation, polar decoding, and successive interference cancellation. Numerical findings demonstrate that the proposed scheme has superior performance compared to the state-of-the-art in various antenna settings.
- [139] arXiv:2406.06295 (cross-list from cs.SD) [pdf, html, other]
-
Title: Zero-Shot Audio Captioning Using Soft and Hard PromptsComments: Submitted to IEEE/ACM Transactions on Audio, Speech and Language ProcessingSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, which, however, has received little attention. We propose an effective audio captioning method based on the contrastive language-audio pre-training (CLAP) model to address these issues. Our proposed method requires only textual data for training, enabling the model to generate text from the textual feature in the cross-modal semantic this http URL the inference stage, the model generates the descriptive text for the given audio from the audio feature by leveraging the audio-text alignment from CLAP.We devise two strategies to mitigate the discrepancy between text and audio embeddings: a mixed-augmentation-based soft prompt and a retrieval-based acoustic-aware hard prompt. These approaches are designed to enhance the generalization performance of our proposed model, facilitating the model to generate captions more robustly and accurately. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method.
- [140] arXiv:2406.06310 (cross-list from cs.SD) [pdf, html, other]
-
Title: Unsupervised Improved MVDR Beamforming for Sound EnhancementSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Neural networks have recently become the dominant approach to sound separation. Their good performance relies on large datasets of isolated recordings. For speech and music, isolated single channel data are readily available; however the same does not hold in the multi-channel case, and with most other sound classes. Multi-channel methods have the potential to outperform single channel approaches as they can exploit both spatial and spectral features, but the lack of training data remains a challenge. We propose unsupervised improved minimum variation distortionless response (UIMVDR), which enables multi-channel separation to leverage in-the-wild single-channel data through unsupervised training and beamforming. Results show that UIMVDR generalizes well and improves separation performance compared to supervised models, particularly in cases with limited supervised data. By using data available online, it also reduces the effort required to gather data for multi-channel approaches.
- [141] arXiv:2406.06329 (cross-list from cs.CL) [pdf, html, other]
-
Title: A Parameter-efficient Language Extension Framework for Multilingual ASRComments: Accepted by Interspeech 2024Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Covering all languages with a multilingual speech recognition model (MASR) is very difficult. Performing language extension on top of an existing MASR is a desirable choice. In this study, the MASR continual learning problem is probabilistically decomposed into language identity prediction (LP) and cross-lingual adaptation (XLA) sub-problems. Based on this, we propose an architecture-based framework for language extension that can fundamentally solve catastrophic forgetting, debudded as PELE. PELE is designed to be parameter-efficient, incrementally incorporating an add-on module to adapt to a new language. Specifically, different parameter-efficient fine-tuning (PEFT) modules and their variants are explored as potential candidates to perform XLA. Experiments are carried out on 5 new languages with a wide range of low-resourced data sizes. The best-performing PEFT candidate can achieve satisfactory performance across all languages and demonstrates superiority in three of five languages over the continual joint learning setting. Notably, PEFT methods focusing on weight parameters or input features are revealed to be limited in performance, showing significantly inferior extension capabilities compared to inserting a lightweight module in between layers such as an Adapter.
- [142] arXiv:2406.06332 (cross-list from cs.SD) [pdf, html, other]
-
Title: An automatic analysis of ultrasound vocalisations for the prediction of interaction context in captive Egyptian fruit batsComments: Accepted at EUSIPCO 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Prior work in computational bioacoustics has mostly focused on the detection of animal presence in a particular habitat. However, animal sounds contain much richer information than mere presence; among others, they encapsulate the interactions of those animals with other members of their species. Studying these interactions is almost impossible in a naturalistic setting, as the ground truth is often lacking. The use of animals in captivity instead offers a viable alternative pathway. However, most prior works follow a traditional, statistics-based approach to analysing interactions. In the present work, we go beyond this standard framework by attempting to predict the underlying context in interactions between captive \emph{Rousettus Aegyptiacus} using deep neural networks. We reach an unweighted average recall of over 30\% -- more than thrice the chance level -- and show error patterns that differ from our statistical analysis. This work thus represents an important step towards the automatic analysis of states in animals from sound.
- [143] arXiv:2406.06337 (cross-list from physics.optics) [pdf, html, other]
-
Title: System- and Sample-agnostic Isotropic 3D Microscopy by Weakly Physics-informed, Domain-shift-resistant Axial DeblurringComments: 27 pages, 6 figuresSubjects: Optics (physics.optics); Image and Video Processing (eess.IV); Biological Physics (physics.bio-ph)
Three-dimensional (3D) subcellular imaging is essential for biomedical research, but the diffraction limit of optical microscopy compromises axial resolution, hindering accurate 3D structural analysis. This challenge is particularly pronounced in label-free imaging of thick, heterogeneous tissues, where assumptions about data distribution (e.g. sparsity, label-specific distribution, and lateral-axial similarity) and system priors (e.g. independent and identically distributed (i.i.d.) noise and linear shift-invariant (LSI) point-spread functions (PSFs)) are often invalid. Here, we introduce SSAI-3D, a weakly physics-informed, domain-shift-resistant framework for robust isotropic 3D imaging. SSAI-3D enables robust axial deblurring by generating a PSF-flexible, noise-resilient, sample-informed training dataset and sparsely fine-tuning a large pre-trained blind deblurring network. SSAI-3D was applied to label-free nonlinear imaging of living organoids, freshly excised human endometrium tissue, and mouse whisker pads, and further validated in publicly available ground-truth-paired experimental datasets of 3D heterogeneous biological tissues with unknown blurring and noise across different microscopy systems.
- [144] arXiv:2406.06339 (cross-list from cs.SD) [pdf, html, other]
-
Title: Audio-based Step-count Estimation for Running -- Windowing and Neural Network BaselinesComments: Accepted at EUSIPCO 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
In recent decades, running has become an increasingly popular pastime activity due to its accessibility, ease of practice, and anticipated health benefits. However, the risk of running-related injuries is substantial for runners of different experience levels. Several common forms of injuries result from overuse -- extending beyond the recommended running time and intensity. Recently, audio-based tracking has emerged as yet another modality for monitoring running behaviour and performance, with previous studies largely concentrating on predicting runner fatigue. In this work, we investigate audio-based step count estimation during outdoor running, achieving a mean absolute error of 1.098 in window-based step-count differences and a Pearson correlation coefficient of 0.479 when predicting the number of steps in a 5-second window of audio. Our work thus showcases the feasibility of audio-based monitoring for estimating important physiological variables and lays the foundations for further utilising audio sensors for a more thorough characterisation of runner behaviour.
- [145] arXiv:2406.06341 (cross-list from cs.SD) [pdf, html, other]
-
Title: Predicting Heart Activity from Speech using Data-driven and Knowledge-based featuresComments: Accepted at Interspeech 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Accurately predicting heart activity and other biological signals is crucial for diagnosis and monitoring. Given that speech is an outcome of multiple physiological systems, a significant body of work studied the acoustic correlates of heart activity. Recently, self-supervised models have excelled in speech-related tasks compared to traditional acoustic methods. However, the robustness of data-driven representations in predicting heart activity remained unexplored. In this study, we demonstrate that self-supervised speech models outperform acoustic features in predicting heart activity parameters. We also emphasize the impact of individual variability on model generalizability. These findings underscore the value of data-driven representations in such tasks and the need for more speech-based physiological data to mitigate speaker-related challenges.
- [146] arXiv:2406.06353 (cross-list from cs.RO) [pdf, other]
-
Title: A quantitative investigation for deployment of mobile collaborative robots in high-value manufacturingAmine Hifi, W. Jackson, C. Loukas, M. Shields, A. Poole, E. Mohseni, C.N. MacLeod, G. Dobie, S.G. Pierce, T. O'Hare, G. Munro, J. O'Brian-O'Reilly, R.W.K. VithanageSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Component inspection is often the bottleneck in high-value manufacturing, driving industries like aerospace toward automated inspection technologies. Current systems often employ fixed arm robots, but they lack the flexibility in adapting to new components or orientations Advanced mobile robotic platforms with updated sensor technologies and algorithms have improved localization and path planning capabilities, making them ideal for bringing inspection processes directly to parts. However, mobile platforms introduce challenges in localization and maneuverability, leading to potential errors. Their positional uncertainty is higher than fixed systems due to the lack of a fixed calibrated location, posing challenges for position-sensitive inspection sensors. Therefore, it's essential to assess the positional accuracy and repeatability of mobile manipulator platforms. The KUKA KMR iiwa was chosen for its collaborative features, robust build, and scalability within the KUKA product range. The accuracy and repeatability of the mobile platform were evaluated through a series of tests to evaluate the performance of its integrated feature mapping, the effect of various speeds on positional accuracy, and the efficiency of the omnidirectional wheels for a range of translation orientations. Experimental evaluation revealed that enabling feature mapping substantially improves the KUKA KMR iiwa's performance, with accuracy gains and error reductions exceeding 90%. Repeatability errors were under 7 mm with mapping activated and around 2.5 mm in practical scenarios, demonstrating that mobile manipulators, incorporating both the manipulator and platform, can fulfil the precise requirements of industries with high precision needs. Providing a highly diverse alternative to traditional fixed-base industrial manipulators.
- [147] arXiv:2406.06371 (cross-list from cs.CL) [pdf, html, other]
-
Title: mHuBERT-147: A Compact Multilingual HuBERT ModelComments: Extended version of the Interspeech 2024 paper of same nameSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment over the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations and with only 95M parameters, mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min/1h leaderboards respectively, with SOTA scores for all LID tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings suggest that mHuBERT-147 is a promising model for multilingual speech processing tasks, offering an unprecedented balance between high performance and parameter efficiency.
- [148] arXiv:2406.06375 (cross-list from cs.SD) [pdf, html, other]
-
Title: MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music ProcessingYu-Fen Huang, Nikki Moran, Simon Coleman, Jon Kelly, Shun-Hwa Wei, Po-Yin Chen, Yun-Hsin Huang, Tsung-Ping Chen, Yu-Chia Kuo, Yu-Chi Wei, Chih-Hsuan Li, Da-Yu Huang, Hsuan-Kai Kao, Ting-Wei Lin, Li SuComments: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024. 14 pages, 7 figures. Dataset is available on: this https URL and this https URLSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music mOtion with Semantic Annotation) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians, comprising more than 30 hours and 570 K notes of data. To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date. To demonstrate the usage of the MOSA dataset, we present several innovative cross-modal music information retrieval (MIR) and musical content generation tasks, including the detection of beats, downbeats, phrase, and expressive contents from audio, video and motion data, and the generation of musicians' body motion from given music audio. The dataset and codes are available alongside this publication (this https URL).
- [149] arXiv:2406.06403 (cross-list from cs.CL) [pdf, html, other]
-
Title: Meta Learning Text-to-Speech Synthesis in over 7000 LanguagesFlorian Lux, Sarina Meyer, Lyonel Behringer, Frank Zalkow, Phat Do, Matt Coler, Emanuël A. P. Habets, Ngoc Thang VuComments: accepted at Interspeech 2024Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech synthesis in languages without any available data. We validate our system's performance through objective measures and human evaluation across a diverse linguistic landscape. By releasing our code and models publicly, we aim to empower communities with limited linguistic resources and foster further innovation in the field of speech technology.
- [150] arXiv:2406.06406 (cross-list from cs.CL) [pdf, html, other]
-
Title: Controlling Emotion in Text-to-Speech with Natural Language PromptsComments: accepted at Interspeech 2024Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
In recent years, prompting has quickly become one of the standard ways of steering the outputs of generative machine learning models, due to its intuitive use of natural language. In this work, we propose a system conditioned on embeddings derived from an emotionally rich text that serves as prompt. Thereby, a joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture. Our approach is trained on merged emotional speech and text datasets and varies prompts in each training iteration to increase the generalization capabilities of the model. Objective and subjective evaluation results demonstrate the ability of the conditioned synthesis system to accurately transfer the emotions present in a prompt to speech. At the same time, precise tractability of speaker identities as well as overall high speech quality and intelligibility are maintained.
- [151] arXiv:2406.06438 (cross-list from cs.CL) [pdf, html, other]
-
Title: Multimodal Contextualized Semantic Parsing from SpeechComments: 10 Pages, 3 figures, ACL 2024 MainSubjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents' contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent's knowledge with new information, mirroring the complexity of human communication. We develop the VG-SPICE dataset, crafted to challenge agents with visual scene graph construction from spoken conversational exchanges, highlighting speech and visual data integration. We also present the Audio-Vision Dialogue Scene Parser (AViD-SP) developed for use on VG-SPICE. These innovations aim to improve multimodal information processing and integration. Both the VG-SPICE dataset and the AViD-SP model are publicly available.
- [152] arXiv:2406.06514 (cross-list from cs.LG) [pdf, html, other]
-
Title: Random Features Approximation for Control-Affine SystemsComments: 25 pages, 3 figuresSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
Modern data-driven control applications call for flexible nonlinear models that are amenable to principled controller synthesis and realtime feedback. Many nonlinear dynamical systems of interest are control affine. We propose two novel classes of nonlinear feature representations which capture control affine structure while allowing for arbitrary complexity in the state dependence. Our methods make use of random features (RF) approximations, inheriting the expressiveness of kernel methods at a lower computational cost. We formalize the representational capabilities of our methods by showing their relationship to the Affine Dot Product (ADP) kernel proposed by Castañeda et al. (2021) and a novel Affine Dense (AD) kernel that we introduce. We further illustrate the utility by presenting a case study of data-driven optimization-based control using control certificate functions (CCF). Simulation experiments on a double pendulum empirically demonstrate the advantages of our methods.
Cross submissions for Tuesday, 11 June 2024 (showing 62 of 62 entries )
- [153] arXiv:2201.11384 (replaced) [pdf, html, other]
-
Title: Phase Retrieval for Radar Waveform DesignComments: 40 pages, 13 figures, 1 tableSubjects: Signal Processing (eess.SP); Information Retrieval (cs.IR)
The ability of a radar to discriminate in both range and Doppler velocity is completely characterized by the ambiguity function (AF) of its transmit waveform. Mathematically, it is obtained by correlating the waveform with its Doppler-shifted and delayed replicas. We consider the inverse problem of designing a radar transmit waveform that satisfies the specified AF magnitude. This process may be viewed as a signal reconstruction with some variation of phase retrieval methods. We provide a trust-region algorithm that minimizes a smoothed non-convex least-squares objective function to iteratively recover the underlying signal-of-interest for either time- or band-limited support. The method first approximates the signal using an iterative spectral algorithm and then refines the attained initialization based on a sequence of gradient iterations. Our theoretical analysis shows that unique signal reconstruction is possible using signal samples no more than thrice the number of signal frequencies or time samples. Numerical experiments demonstrate that our method recovers both time- and band-limited signals from sparsely and randomly sampled, noisy, and noiseless AFs.
- [154] arXiv:2204.11970 (replaced) [pdf, other]
-
Title: Visual Acuity Prediction on Real-Life Patient Data Using a Machine Learning Based Multistage SystemTobias Schlosser, Frederik Beuth, Trixy Meyer, Arunodhayan Sampath Kumar, Gabriel Stolze, Olga Furashova, Katrin Engelmann, Danny KowerkoComments: Accepted for: Scientific ReportsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
In ophthalmology, intravitreal operative medication therapy (IVOM) is a widespread treatment for diseases related to the age-related macular degeneration (AMD), the diabetic macular edema (DME), as well as the retinal vein occlusion (RVO). However, in real-world settings, patients often suffer from loss of vision on time scales of years despite therapy, whereas the prediction of the visual acuity (VA) and the earliest possible detection of deterioration under real-life conditions is challenging due to heterogeneous and incomplete data. In this contribution, we present a workflow for the development of a research-compatible data corpus fusing different IT systems of the department of ophthalmology of a German maximum care hospital. The extensive data corpus allows predictive statements of the expected progression of a patient and his or her VA in each of the three diseases. For the disease AMD, we found out a significant deterioration of the visual acuity over time. Within our proposed multistage system, we subsequently classify the VA progression into the three groups of therapy "winners", "stabilizers", and "losers" (WSL classification scheme). Our OCT biomarker classification using an ensemble of deep neural networks results in a classification accuracy (F1-score) of over 98 %, enabling us to complete incomplete OCT documentations while allowing us to exploit them for a more precise VA modeling process. Our VA prediction requires at least four VA examinations and optionally OCT biomarkers from the same time period to predict the VA progression within a forecasted time frame, whereas our prediction is currently restricted to IVOM / no therapy. We achieve a final prediction accuracy of 69 % in macro average F1-score, while being in the same range as the ophthalmologists with 57.8 and 50 +- 10.7 % F1-score.
- [155] arXiv:2209.07689 (replaced) [pdf, html, other]
-
Title: A Unified Multi-Task Semantic Communication System for Multimodal DataSubjects: Signal Processing (eess.SP)
Task-oriented semantic communications have achieved significant performance gains. However, the employed deep neural networks in semantic communications have to be updated when the task is changed or multiple models need to be stored for performing different tasks. To address this issue, we develop a unified deep learning-enabled semantic communication system (U-DeepSC), where a unified end-to-end framework can serve many different tasks with multiple modalities of data. As the number of required features varies from task to task, we propose a vector-wise dynamic scheme that can adjust the number of transmitted symbols for different tasks. Moreover, our dynamic scheme can also adaptively adjust the number of transmitted features under different channel conditions to optimize the transmission efficiency. Particularly, we devise a lightweight feature selection module (FSM) to evaluate the importance of feature vectors, which can hierarchically drop redundant feature vectors and significantly accelerate the inference. To reduce the transmission overhead, we then design a unified codebook for feature representation to serve multiple tasks, where only the indices of these task-specific features in the codebook are transmitted. According to the simulation results, the proposed U-DeepSC achieves comparable performance to the task-oriented semantic communication system designed for a specific task but with significant reduction in both transmission overhead and model size.
- [156] arXiv:2301.12537 (replaced) [pdf, html, other]
-
Title: Non-Asymptotic State-Space Identification of Closed-Loop Stochastic Linear Systems using Instrumental VariablesComments: 12 pages, 4 tables, 3 figuresJournal-ref: Systems & Control Letters, Elsevier, Volume 178, 2023, 105565Subjects: Systems and Control (eess.SY); Dynamical Systems (math.DS); Methodology (stat.ME)
The paper suggests a generalization of the Sign-Perturbed Sums (SPS) finite sample system identification method for the identification of closed-loop observable stochastic linear systems in state-space form. The solution builds on the theory of matrix-variate regression and instrumental variable methods to construct distribution-free confidence regions for the state-space matrices. Both direct and indirect identification are studied, and the exactness as well as the strong consistency of the construction are proved. Furthermore, a new, computationally efficient ellipsoidal outer-approximation algorithm for the confidence regions is proposed. The new construction results in a semidefinite optimization problem which has an order-of-magnitude smaller number of constraints, as if one applied the ellipsoidal outer-approximation after vectorization. The effectiveness of the approach is also demonstrated empirically via a series of numerical experiments.
- [157] arXiv:2303.06617 (replaced) [pdf, html, other]
-
Title: IFF: A Super-resolution Algorithm for Multiple MeasurementsSubjects: Signal Processing (eess.SP)
We consider the problem of reconstructing one-dimensional point sources from their Fourier measurements in a bounded interval $[-\Omega, \Omega]$. This problem is known to be challenging in the regime where the spacing of the sources is below the Rayleigh length $\frac{\pi}{\Omega}$. In this paper, we propose a super-resolution algorithm, called Iterative Focusing-localization and Filtering (IFF), to resolve closely spaced point sources from their multiple measurements that are obtained by using multiple unknown illumination patterns. The new proposed algorithm has a distinct feature in that it reconstructs the point sources one by one in an iterative manner and hence requires no prior information about the source numbers. The new feature also allows for a subsampling strategy that can circumvent the computation of singular-value decomposition for large matrices as in the usual subspace methods. A theoretical analysis of the methods behind the algorithm is also provided. The derived results imply a phase transition phenomenon in the reconstruction of source locations which is confirmed in numerical experiments. Numerical results show that the algorithm can achieve a stable reconstruction for point sources with a minimum separation distance that is close to the theoretical limit. The algorithm can be generalized to higher dimensions.
- [158] arXiv:2304.03699 (replaced) [pdf, html, other]
-
Title: Sorta Solving the OPF by Not Solving the OPF: DAE Control Theory and the Price of Realtime RegulationSubjects: Systems and Control (eess.SY)
This paper presents a new approach to approximate the AC optimal power flow (ACOPF). By eliminating the need to solve the ACOPF every few minutes, the paper showcases how a realtime feedback controller can be utilized in lieu of ACOPF and its variants. By (i) forming the grid dynamics as a system of differential-algebraic equations (DAE) that naturally encode the non-convex OPF power flow constraints, (ii) utilizing DAELyapunov theory, and (iii) designing a feedback controller that captures realtime uncertainty while being uncertainty-unaware, the presented approach demonstrates promises of obtaining solutions that are close to the OPF ones without needing to solve the OPF. The proposed controller responds in realtime to deviations in renewables generation and loads, guaranteeing improvements in system transient stability, while always yielding approximate solutions of the ACOPF with no constraint violations. As the studied approach herein yields slightly more expensive realtime generator controls, the corresponding price of realtime control and regulation is examined. Cost comparisons with the traditional ACOPF are also showcased -- all via case studies on standard power networks.
- [159] arXiv:2304.07766 (replaced) [pdf, html, other]
-
Title: JUMP: Joint communication and sensing with Unsynchronized transceivers Made PracticalJacopo Pegoraro, Jesus O. Lacruz, Tommy Azzino, Marco Mezzavilla, Michele Rossi, Joerg Widmer, Sundeep RanganComments: 17 pages, 18 figuresJournal-ref: IEEE Transactions on Wireless Communications, 2024Subjects: Signal Processing (eess.SP)
Wideband millimeter-wave communication systems can be extended to provide radar-like sensing capabilities on top of data communication, in a cost-effective manner. However, the development of joint communication and sensing technology is hindered by practical challenges, such as occlusions to the line-of-sight path and clock asynchrony between devices. The latter introduces time-varying timing and frequency offsets that prevent the estimation of sensing parameters and, in turn, the use of standard signal processing solutions. Existing approaches cannot be applied to commonly used phased-array receivers, as they build on stringent assumptions about the multipath environment, and are computationally complex. We present JUMP, the first system enabling practical bistatic and asynchronous joint communication and sensing, while achieving accurate target tracking and micro-Doppler extraction in realistic conditions. Our system compensates for the timing offset by exploiting the channel correlation across subsequent packets. Further, it tracks multipath reflections and eliminates frequency offsets by observing the phase of a dynamically-selected static reference path. JUMP has been implemented on a 60 GHz experimental platform, performing extensive evaluations of human motion sensing, including non-line-of-sight scenarios. In our results, JUMP attains comparable tracking performance to a full-duplex monostatic system and similar micro-Doppler quality with respect to a phase-locked bistatic receiver.
- [160] arXiv:2305.14036 (replaced) [pdf, html, other]
-
Title: Robust Fault Estimators for Nonlinear Systems: An Ultra-Local Model DesignComments: arXiv admin note: text overlap with arXiv:2204.01455Subjects: Systems and Control (eess.SY)
This paper proposes a nonlinear estimator for the robust reconstruction of process and sensor faults for a class of uncertain nonlinear systems. The proposed fault estimation method augments the system dynamics with an ultra-local (in time) internal state-space representation (a finite chain of integrators) of the fault vector. Next, a nonlinear state observer is designed based on the known parts of the augmented dynamics. This nonlinear filter (observer) reconstructs the fault signal as well as the states of the augmented system. We provide sufficient conditions that guarantee stability of the estimation error dynamics: firstly, asymptotic stability (i.e., exact fault estimation) in the absence of perturbations induced by the fault model mismatch (mismatch between internal ultra-local model for the fault and the actual fault dynamics), uncertainty, external disturbances, and measurement noise and, secondly, Input-to-State Stability (ISS) of the estimation error dynamics is guaranteed in the presence of these perturbations. In addition, to support performance-based estimator design, we provide Linear Matrix Inequality (LMI) conditions for $\mathcal{L}_2$-gain and $\mathcal{L}_2-\mathcal{L}_\infty$ induced norm and cast the synthesis of the estimator gains as a semi-definite program where the effect of model mismatch and external disturbances on the fault estimation error is minimized in the sense of $\mathcal{L}_2$-gain, for an acceptable $\mathcal{L}_2-\mathcal{L}_\infty$ induced norm with respect to measurement noise. The latter result facilitates a design that explicitly addresses the performance trade-off between noise sensitivity and robustness against model mismatch and external disturbances. Finally, numerical results for a benchmark system illustrate the performance of the proposed methodologies.
- [161] arXiv:2308.11639 (replaced) [pdf, html, other]
-
Title: An Empirical Study on Fault Detection and Root Cause Analysis of Indium Tin Oxide Electrodes by Processing S-parameter PatternsComments: Accepted in IEEE Transactions on Device and Materials ReliabilitySubjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
In the field of optoelectronics, indium tin oxide (ITO) electrodes play a crucial role in various applications, such as displays, sensors, and solar cells. Effective fault diagnosis and root cause analysis of the ITO electrodes are essential to ensure the performance and reliability of the devices. However, traditional visual inspection is challenging with transparent ITO electrodes, and existing fault diagnosis methods have limitations in determining the root causes of the defects, often requiring destructive evaluations and secondary material characterization techniques. In this study, a fault diagnosis method with root cause analysis is proposed using scattering parameter (S-parameter) patterns, offering early detection, high diagnostic accuracy, and noise robustness. A comprehensive S-parameter pattern database is obtained according to various defect states of the ITO electrodes. Deep learning (DL) approaches, including multilayer perceptron (MLP), convolutional neural network (CNN), and transformer, are then used to simultaneously analyze the cause and severity of defects. Notably, it is demonstrated that the diagnostic performance under additive noise levels can be significantly enhanced by combining different channels of the S-parameters as input to the learning algorithms, as confirmed through the t-distributed stochastic neighbor embedding (t-SNE) dimension reduction visualization of the S-parameter patterns.
- [162] arXiv:2309.07648 (replaced) [pdf, html, other]
-
Title: Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural TransducerComments: Accepted in INTERSPEECH 2024Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Despite advancements of end-to-end (E2E) models in speech recognition, named entity recognition (NER) is still challenging but critical for semantic understanding. Previous studies mainly focus on various rule-based or attention-based contextual biasing algorithms. However, their performance might be sensitive to the biasing weight or degraded by excessive attention to the named entity list, along with a risk of false triggering. Inspired by the success of the class-based language model (LM) in NER in conventional hybrid systems and the effective decoupling of acoustic and linguistic information in the factorized neural Transducer (FNT), we propose C-FNT, a novel E2E model that incorporates class-based LMs into FNT. In C-FNT, the LM score of named entities can be associated with the name class instead of its surface form. The experimental results show that our proposed C-FNT significantly reduces error in named entities without hurting performance in general word recognition.
- [163] arXiv:2310.15426 (replaced) [pdf, html, other]
-
Title: zonoLAB: A MATLAB toolbox for set-based control systems analysis using hybrid zonotopesSubjects: Systems and Control (eess.SY)
This paper introduces zonoLAB, a MATLAB-based toolbox for set-based control system analysis using the hybrid zonotope set representation. Hybrid zonotopes have proven to be an expressive set representation that can exactly represent the reachable sets of mixed-logical dynamical systems and tightly approximate the reachable sets of nonlinear dynamic systems. Moreover, hybrid zonotopes can exactly represent the continuous piecewise linear control laws associated with model predictive control and the input-output mappings of neural networks with piecewise linear activation functions. The hybrid zonotope set representation is also highly exploitable, where efficient methods developed for mixed-integer linear programming can be directly used for set operation and analysis. The zonoLAB toolbox is designed to make these capabilities accessible to the dynamic systems and controls community, with functionality spanning fundamental operations with hybrid zonotope, constrained zonotope, and zonotope set representations, powerful set analysis tools, and general-purpose algorithms for reachability analysis of open- and closed-loop systems.
- [164] arXiv:2311.09253 (replaced) [pdf, html, other]
-
Title: The Perception-Robustness Tradeoff in Deterministic Image RestorationSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
We study the behavior of deterministic methods for solving inverse problems in imaging. These methods are commonly designed to achieve two goals: (1) attaining high perceptual quality, and (2) generating reconstructions that are consistent with the measurements. We provide a rigorous proof that the better a predictor satisfies these two requirements, the larger its Lipschitz constant must be, regardless of the nature of the degradation involved. In particular, to approach perfect perceptual quality and perfect consistency, the Lipschitz constant of the model must grow to infinity. This implies that such methods are necessarily more susceptible to adversarial attacks. We demonstrate our theory on single image super-resolution algorithms, addressing both noisy and noiseless settings. We also show how this undesired behavior can be leveraged to explore the posterior distribution, thereby allowing the deterministic model to imitate stochastic methods.
- [165] arXiv:2312.13319 (replaced) [pdf, html, other]
-
Title: In2SET: Intra-Inter Similarity Exploiting Transformer for Dual-Camera Compressive Hyperspectral ImagingComments: CVPR 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Dual-Camera Compressed Hyperspectral Imaging (DCCHI) offers the capability to reconstruct 3D Hyperspectral Image (HSI) by fusing compressive and Panchromatic (PAN) image, which has shown great potential for snapshot hyperspectral imaging in practice. In this paper, we introduce a novel DCCHI reconstruction network, the Intra-Inter Similarity Exploiting Transformer (In2SET). Our key insight is to make full use of the PAN image to assist the reconstruction. To this end, we propose using the intra-similarity within the PAN image as a proxy for approximating the intra-similarity in the original HSI, thereby offering an enhanced content prior for more accurate HSI reconstruction. Furthermore, we aim to align the features from the underlying HSI with those of the PAN image, maintaining semantic consistency and introducing new contextual information for the reconstruction process. By integrating In2SET into a PAN-guided unrolling framework, our method substantially enhances the spatial-spectral fidelity and detail of the reconstructed images, providing a more comprehensive and accurate depiction of the scene. Extensive experiments conducted on both real and simulated datasets demonstrate that our approach consistently outperforms existing state-of-the-art methods in terms of reconstruction quality and computational complexity. Code will be released.
- [166] arXiv:2401.03271 (replaced) [pdf, html, other]
-
Title: Analysis and Validation of Image Search Engines in HistopathologyIsaiah Lahr, Saghir Alfasly, Peyman Nejat, Jibran Khan, Luke Kottom, Vaishnavi Kumbhar, Areej Alsaafin, Abubakr Shafique, Sobhan Hemati, Ghazal Alabtah, Nneka Comfere, Dennis Murphee, Aaron Mangold, Saba Yasir, Chady Meroueh, Lisa Boardman, Vijay H. Shah, Joaquin J. Garcia, H.R. TizhooshSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
Searching for similar images in archives of histology and histopathology images is a crucial task that may aid in patient matching for various purposes, ranging from triaging and diagnosis to prognosis and prediction. Whole slide images (WSIs) are highly detailed digital representations of tissue specimens mounted on glass slides. Matching WSI to WSI can serve as the critical method for patient matching. In this paper, we report extensive analysis and validation of four search methods bag of visual words (BoVW), Yottixel, SISH, RetCCL, and some of their potential variants. We analyze their algorithms and structures and assess their performance. For this evaluation, we utilized four internal datasets ($1269$ patients) and three public datasets ($1207$ patients), totaling more than $200,000$ patches from $38$ different classes/subtypes across five primary sites. Certain search engines, for example, BoVW, exhibit notable efficiency and speed but suffer from low accuracy. Conversely, search engines like Yottixel demonstrate efficiency and speed, providing moderately accurate results. Recent proposals, including SISH, display inefficiency and yield inconsistent outcomes, while alternatives like RetCCL prove inadequate in both accuracy and efficiency. Further research is imperative to address the dual aspects of accuracy and minimal storage requirements in histopathological image search.
- [167] arXiv:2401.03689 (replaced) [pdf, html, other]
-
Title: LUPET: Incorporating Hierarchical Information Path into Multilingual ASRComments: Accepted by Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Toward high-performance multilingual automatic speech recognition (ASR), various types of linguistic information and model design have demonstrated their effectiveness independently. They include language identity (LID), phoneme information, language-specific processing modules, and cross-lingual self-supervised speech representation. It is expected that leveraging their benefits synergistically in a unified solution would further improve the overall system performance. This paper presents a novel design of a hierarchical information path, named LUPET, which sequentially encodes, from the shallow layers to deep layers, multiple aspects of linguistic and acoustic information at diverse granularity scales. The path starts from LID prediction, followed by acoustic unit discovery, phoneme sharing, and finally token recognition routed by a mixture-of-expert. ASR experiments are carried out on 10 languages in the Common Voice corpus. The results demonstrate the superior performance of LUPET as compared to the baseline systems. Most importantly, LUPET effectively mitigates the issue of performance compromise of high-resource languages with low-resource ones in the multilingual setting.
- [168] arXiv:2401.05452 (replaced) [pdf, html, other]
-
Title: Cuff-less Arterial Blood Pressure Waveform Synthesis from Single-site PPG using Transformer & Frequency-domain LearningMuhammad Wasim Nawaz, Muhammad Ahmad Tahir, Ahsan Mehmood, Muhammad Mahboob Ur Rahman, Kashif Riaz, Qammer H. AbbasiComments: 8 pages, 3 figures, 2 tables, submitted for review and potential publicationSubjects: Signal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
We develop and evaluate two novel purpose-built deep learning (DL) models for synthesis of the arterial blood pressure (ABP) waveform in a cuff-less manner, using a single-site photoplethysmography (PPG) signal. We train and evaluate our DL models on the data of 209 subjects from the public UCI dataset on cuff-less blood pressure (CLBP) estimation. Our transformer model consists of an encoder-decoder pair that incorporates positional encoding, multi-head attention, layer normalization, and dropout techniques for ABP waveform synthesis. Secondly, under our frequency-domain (FD) learning approach, we first obtain the discrete cosine transform (DCT) coefficients of the PPG and ABP signals, and then learn a linear/non-linear (L/NL) regression between them. The transformer model (FD L/NL model) synthesizes the ABP waveform with a mean absolute error (MAE) of 3.01 (4.23). Further, the synthesis of ABP waveform also allows us to estimate the systolic blood pressure (SBP) and diastolic blood pressure (DBP) values. To this end, the transformer model reports an MAE of 3.77 mmHg and 2.69 mmHg, for SBP and DBP, respectively. On the other hand, the FD L/NL method reports an MAE of 4.37 mmHg and 3.91 mmHg, for SBP and DBP, respectively. Both methods fulfill the AAMI criterion. As for the BHS criterion, our transformer model (FD L/NL regression model) achieves grade A (grade B).
- [169] arXiv:2401.12345 (replaced) [pdf, html, other]
-
Title: Distributionally Robust Receive BeamformingSubjects: Signal Processing (eess.SP)
This article investigates signal estimation in wireless transmission (i.e., receive beamforming) from the perspective of statistical machine learning, where the transmit signals may be from an integrated sensing and communication system; that is, 1) signals may be not only discrete constellation points but also arbitrary complex values; 2) signals may be spatially correlated. Particular attention is paid to handling various uncertainties such as the uncertainty of the transmit signal covariance, the uncertainty of the channel matrix, the uncertainty of the channel noise covariance, the existence of channel impulse noises, and the limited sample size of pilots. To proceed, a distributionally robust machine learning framework that is insensitive to the above uncertainties is proposed, which reveals that channel estimation is not a necessary operation. For optimal linear estimation, the proposed framework includes several existing beamformers as special cases such as diagonal loading and eigenvalue thresholding. For optimal nonlinear estimation, estimators are limited in reproducing kernel Hilbert spaces and neural network function spaces, and corresponding uncertainty-aware solutions (e.g., kernelized diagonal loading) are derived. In addition, we prove that the ridge and kernel ridge regression methods in machine learning are distributionally robust against diagonal perturbation in feature covariance.
- [170] arXiv:2402.03492 (replaced) [pdf, html, other]
-
Title: Beyond Strong labels: Weakly-supervised Learning Based on Gaussian Pseudo Labels for The Segmentation of Ellipse-like Vascular Structures in Non-contrast CTsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Deep-learning-based automated segmentation of vascular structures in preoperative CT scans contributes to computer-assisted diagnosis and intervention procedure in vascular diseases. While CT angiography (CTA) is the common standard, non-contrast CT imaging is significant as a contrast-risk-free alternative, avoiding complications associated with contrast agents. However, the challenges of labor-intensive labeling and high labeling variability due to the ambiguity of vascular boundaries hinder conventional strong-label-based, fully-supervised learning in non-contrast CTs. This paper introduces a weakly-supervised framework using ellipses' topology in slices, including 1) an efficient annotation process based on predefined standards, 2) ellipse-fitting processing, 3) the generation of 2D Gaussian heatmaps serving as pseudo labels, 4) a training process through a combination of voxel reconstruction loss and distribution loss with the pseudo labels. We assess the effectiveness of the proposed method on one local and two public datasets comprising non-contrast CT scans, particularly focusing on the abdominal aorta. On the local dataset, our weakly-supervised learning approach based on pseudo labels outperforms strong-label-based fully-supervised learning (1.54\% of Dice score on average), reducing labeling time by around 82.0\%. The efficiency in generating pseudo labels allows the inclusion of label-agnostic external data in the training set, leading to an additional improvement in performance (2.74\% of Dice score on average) with a reduction of 66.3\% labeling time, where the labeling time remains considerably less than that of strong labels. On the public dataset, the pseudo labels achieve an overall improvement of 1.95\% in Dice score for 2D models while a reduction of 11.65 voxel spacing in Hausdorff distance for 3D model.
- [171] arXiv:2403.03489 (replaced) [pdf, html, other]
-
Title: Global Geolocated Realtime Data of Interfleet Urban Transit Bus IdlingComments: 34 pages, 12 figures, 36 tables, 100 data sources (including links). Under Review at Nature Scientific DataSubjects: Systems and Control (eess.SY); Computers and Society (cs.CY)
Urban transit bus idling is a contributor to ecological stress, economic inefficiency, and medically hazardous health outcomes due to emissions. The global accumulation of this frequent pattern of undesirable driving behavior is enormous. In order to measure its scale, we propose GRD-TRT- BUF-4I (Ground Truth Buffer for Idling) an extensible, realtime detection system that records the geolocation and idling duration of urban transit bus fleets internationally. Using live vehicle locations from General Transit Feed Specification (GTFS) Realtime, the system detects approximately 200,000 idling events per day from over 50 cities across North America, Europe, Oceania, and Asia. This realtime data was created to dynamically serve operational decision-making and fleet management to reduce the frequency and duration of idling events as they occur, as well as to capture its accumulative effects. Civil and Transportation Engineers, Urban Planners, Epidemiologists, Policymakers, and other stakeholders might find this useful for emissions modeling, traffic management, route planning, and other urban sustainability efforts at a variety of geographic and temporal scales.
- [172] arXiv:2403.03849 (replaced) [pdf, html, other]
-
Title: MedMamba: Vision Mamba for Medical Image ClassificationSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Since the era of deep learning, convolutional neural networks (CNNs) and vision transformers (ViTs) have been extensively studied and widely used in medical image classification tasks. Unfortunately, CNN's limitations in modeling long-range dependencies result in poor classification performances. In contrast, ViTs are hampered by the quadratic computational complexity of their self-attention mechanism, making them difficult to deploy in real-world settings with limited computational resources. Recent studies have shown that state space models (SSMs) represented by Mamba can effectively model long-range dependencies while maintaining linear computational complexity. Inspired by it, we proposed MedMamba, the first vision Mamba for generalized medical image classification. Concretely, we introduced a novel hybrid basic block named SS-Conv-SSM, which integrates the convolutional layers for extracting local features with the abilities of SSM to capture long-range dependencies, aiming to model medical images from different image modalities efficiently. By employing the grouped convolution strategy and channel-shuffle operation, MedMamba successfully provides fewer model parameters and a lower computational burden for efficient applications. To demonstrate the potential of MedMamba, we conducted extensive experiments using 16 datasets containing ten imaging modalities and 411,007 images. Experimental results show that the proposed MedMamba demonstrates competitive performance in classifying various medical images compared with the state-of-the-art methods. Our work is aims to establish a new baseline for medical image classification and provide valuable insights for developing more powerful SSM-based artificial intelligence algorithms and application systems in the medical field. The source codes and all pre-trained weights of MedMamba are available at this https URL.
- [173] arXiv:2403.11344 (replaced) [pdf, html, other]
-
Title: Bayesian multi-exposure image fusion for robust high dynamic range ptychographyShantanu Kodgirwar, Lars Loetgering, Chang Liu, Aleena Joseph, Leona Licht, Daniel S. Penagos Molina, Wilhelm Eschen, Jan Rothhardt, Michael HabeckSubjects: Image and Video Processing (eess.IV); Applications (stat.AP)
The limited dynamic range of the detector can impede coherent diffractive imaging (CDI) schemes from achieving diffraction-limited resolution. To overcome this limitation, a straightforward approach is to utilize high dynamic range (HDR) imaging through multi-exposure image fusion (MEF). This method involves capturing measurements at different exposure times, spanning from under to overexposure and fusing them into a single HDR image. The conventional MEF technique in ptychography typically involves subtracting the background noise, ignoring the saturated pixels and then merging the acquisitions. However, this approach is inadequate under conditions of low signal-to-noise ratio (SNR). Additionally, variations in illumination intensity significantly affect the phase retrieval process. To address these issues, we propose a Bayesian MEF modeling approach based on a modified Poisson distribution that takes the background and saturation into account. To infer the model parameters, the expectation-maximization (EM) algorithm is employed. As demonstrated with synthetic and experimental data, our approach outperforms the conventional MEF method, offering superior phase retrieval under challenging experimental conditions. This work underscores the significance of robust multi-exposure image fusion for ptychography, particularly in imaging shot-noise-dominated weakly scattering specimens or in cases where access to HDR detectors with high SNR is limited. Furthermore, the applicability of the Bayesian MEF approach extends beyond CDI to any imaging scheme that requires HDR treatment. Given this versatility, we provide the implementation of our algorithm as a Python package.
- [174] arXiv:2403.15769 (replaced) [pdf, html, other]
-
Title: FusionINN: Decomposable Image Fusion for Brain Tumor MonitoringComments: Accepted at IJCAI Workshop 2024. Source code available at this https URLSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Image fusion typically employs non-invertible neural networks to merge multiple source images into a single fused image. However, for clinical experts, solely relying on fused images may be insufficient for making diagnostic decisions, as the fusion mechanism blends features from source images, thereby making it difficult to interpret the underlying tumor pathology. We introduce FusionINN, a novel decomposable image fusion framework, capable of efficiently generating fused images and also decomposing them back to the source images. FusionINN is designed to be bijective by including a latent image alongside the fused image, while ensuring minimal transfer of information from the source images to the latent representation. To the best of our knowledge, we are the first to investigate the decomposability of fused images, which is particularly crucial for life-sensitive applications such as medical image fusion compared to other tasks like multi-focus or multi-exposure image fusion. Our extensive experimentation validates FusionINN over existing discriminative and generative fusion methods, both subjectively and objectively. Moreover, compared to a recent denoising diffusion-based fusion model, our approach offers faster and qualitatively better fusion results.
- [175] arXiv:2403.17083 (replaced) [pdf, html, other]
-
Title: A Study in Dataset Pruning for Image Super-ResolutionSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
In image Super-Resolution (SR), relying on large datasets for training is a double-edged sword. While offering rich training material, they also demand substantial computational and storage resources. In this work, we analyze dataset pruning to solve these challenges. We introduce a novel approach that reduces a dataset to a core-set of training samples, selected based on their loss values as determined by a simple pre-trained SR model. By focusing the training on just 50\% of the original dataset, specifically on the samples characterized by the highest loss values, we achieve results comparable to or surpassing those obtained from training on the entire dataset. Interestingly, our analysis reveals that the top 5\% of samples with the highest loss values negatively affect the training process. Excluding these samples and adjusting the selection to favor easier samples further enhances training outcomes. Our work opens new perspectives to the untapped potential of dataset pruning in image SR. It suggests that careful selection of training data based on loss-value metrics can lead to better SR models, challenging the conventional wisdom that more data inevitably leads to better performance.
- [176] arXiv:2403.17864 (replaced) [pdf, html, other]
-
Title: Synthetic training set generation using text-to-audio models for environmental sound classificationSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)
In the past few years, text-to-audio models have emerged as a significant advancement in automatic audio generation. Although they represent impressive technological progress, the effectiveness of their use in the development of audio applications remains uncertain. This paper aims to investigate these aspects, specifically focusing on the task of classification of environmental sounds. This study analyzes the performance of two different environmental classification systems when data generated from text-to-audio models is used for training. Two cases are considered: a) when the training dataset is augmented by data coming from two different text-to-audio models; and b) when the training dataset consists solely of synthetic audio generated. In both cases, the performance of the classification task is tested on real data. Results indicate that text-to-audio models are effective for dataset augmentation, whereas the performance of the models drops when relying on only generated audio.
- [177] arXiv:2403.18087 (replaced) [pdf, html, other]
-
Title: Channel Estimation and Beamforming for Beyond Diagonal Reconfigurable Intelligent SurfacesComments: 14 pages, 10 figures, submitted to IEEE journalSubjects: Signal Processing (eess.SP); Information Theory (cs.IT)
Beyond diagonal reconfigurable intelligent surface (BD-RIS) is a new advance and generalization of the RIS technique. BD-RIS breaks through the isolation between RIS elements by creatively introducing inter-element connections, thereby enabling smarter wave manipulation and enlarging coverage. However, exploring proper channel estimation schemes suitable for BD-RIS aided communication systems still remains an open problem. In this paper, we study channel estimation and beamforming design for BD-RIS aided multi-antenna systems. We first describe the channel estimation strategy based on the least square (LS) method, derive the mean square error (MSE) of the LS estimation, and formulate the joint pilot sequence and BD-RIS design problem with unique constraints induced by BD-RIS architectures. Specifically, we propose an efficient pilot sequence and BD-RIS design which theoretically guarantees to achieve the minimum MSE. With the estimated channel, we then consider two BD-RIS scenarios and propose beamforming design algorithms. Finally, we provide simulation results to verify the effectiveness of the proposed channel estimation scheme and beamforming design algorithms. We also show that more interelement connections in BD-RIS improves the performance while increasing the training overhead for channel estimation.
- [178] arXiv:2404.10911 (replaced) [pdf, html, other]
-
Title: Efficient Batch and Recursive Least Squares for Matrix Parameter EstimationComments: Accepted to the IEEE Control Systems LettersSubjects: Signal Processing (eess.SP); Systems and Control (eess.SY)
Traditionally, batch least squares (BLS) and recursive least squares (RLS) are used for identification of a vector of parameters that form a linear model. In some situations, however, it is of interest to identify parameters in a matrix structure. In this case, a common approach is to transform the problem into standard vector form using the vectorization (vec) operator and the Kronecker product, known as vec-permutation. However, the use of the Kronecker product introduces extraneous zero terms in the regressor, resulting in unnecessary additional computational and space requirements. This work derives matrix BLS and RLS formulations which, under mild assumptions, minimize the same cost as the vec-permutation approach. This new approach requires less computational complexity and space complexity than vec-permutation in both BLS and RLS identification. It is also shown that persistent excitation guarantees convergence to the true matrix parameters. This method can used to improve computation time in the online identification of multiple-input, multiple-output systems for indirect adaptive model predictive control.
- [179] arXiv:2405.13653 (replaced) [pdf, html, other]
-
Title: Downlink Power Control based UE-Sided Initial Access for Tactical 5G NRComments: Submitted to IEEE MILCOM 2024Subjects: Signal Processing (eess.SP)
Communication technologies play a crucial role in battlefields. They are an inalienable part of any tactical response, whether at the battlefront or inland. Such scenarios require that the communication technologies be versatile, scalable, cost-effective, and stealthy. While multiple studies and past products have tried to address these requirements, none of them have been able to solve all the four challenges simultaneously. Hence, in this paper, we propose a tactical solution that is based on the versatile, scalable, and cost effective 5G NR system. Our focus is on the initial-access phase which is subject to a high probability of detection by an eavesdropper. To address this issue, we propose some modifications to how the UE performs initial access that lower the probability of detection while not affecting standards compliance and not requiring any modifications to the user equipment (UE) chipset implementation. Further, we demonstrate that with a simple downlink power control algorithm, we reduce the probability of detection at an eavesdropper. The result is a 5G NR based initial-access method that improves stealthiness when compared with a vanilla 5G NR implementation.
- [180] arXiv:2405.16543 (replaced) [pdf, html, other]
-
Title: Periodic Scenario Trees: A Novel Framework for Robust Periodic Invariance and Stabilization of Constrained Uncertain Linear SystemsSubjects: Systems and Control (eess.SY)
This work proposes a new a framework for determining robust periodic invariant sets and their associated control laws for constrained uncertain linear systems. Necessary and sufficient conditions for stabilizability by periodic controllers are stated and proven using finite step Lyapunov functions for the unconstrained case. We then introduce a scenario tree interpretation of finite step Lyapunov functions for uncertain systems and show that this interpretation results in useful criteria for the design of robust stabilizing controllers. In particular, novel convex feasibility criteria for the synthesis of simple static controllers and what we call linear interpolating tree periodic controllers with memory are derived. It is proven that for a sufficiently large length of the period, a stabilizing linear interpolating tree periodic controller can always be found using the proposed criterion provided that the uncertain system is stabilizable by such controllers. In this sense, the presented synthesis method is non-conservative. The results are then extended to constrained uncertain linear systems and conditions for controllers that realize robust periodic invariant sets which are less conservative than those that result from the known methods in the literature are derived.
- [181] arXiv:2405.17654 (replaced) [pdf, other]
-
Title: Data-Driven Personalized Energy Consumption Range Estimation for Plug-in Hybrid Electric Vehicles in Urban TrafficMehmet Fatih Ozkan, James Farrell, Marcello Telloni, Luis Mendez, Radu Pirvan, Jeffrey P. Chrstos, Marcello Canova, Stephanie StockarSubjects: Systems and Control (eess.SY)
In urban traffic environments, driver behaviors exhibit considerable diversity in vehicle operation, encompassing a range of acceleration and braking maneuvers as well as adherence to traffic regulations, such as speed limits. It is well-established that these intrinsic driving behaviors significantly influence vehicle energy consumption. Therefore, establishing a quantitative relationship between driver behavior and energy usage is essential for identifying energy-efficient driving practices and optimizing routes within urban traffic. This study introduces a data-driven approach to predict the equivalent fuel consumption of a plug-in hybrid electric vehicle (PHEV) based on an integrated model of driver behavior and vehicle energy consumption. Unlike traditional models that provide point predictions of fuel consumption, this approach uses Conformalized Quantile Regression (CQR) to offer prediction intervals that capture the variability and uncertainty in fuel consumption. These intervals reflect changes in fuel consumption, as well as variations in driver behavior, and vehicle and route conditions. To develop this model, driver-specific data were collected through a driver-in-the-loop simulator, which tested different human drivers responses. The CQR model was then trained and validated using the experimental data from the driver-in-the-loop simulator, augmented by the synthetic data generated from Monte Carlo simulations conducted using a calibrated microscopic driver behavior and vehicle energy model. The CQR model was evaluated by comparing its predictions of equivalent fuel consumption intervals with those of baseline prediction interval methods that rely solely on conformal prediction.
- [182] arXiv:2406.02077 (replaced) [pdf, html, other]
-
Title: Multi-target stain normalization for histology slidesSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Traditional staining normalization approaches, e.g. Macenko, typically rely on the choice of a single representative reference image, which may not adequately account for the diverse staining patterns of datasets collected in practical scenarios. In this study, we introduce a novel approach that leverages multiple reference images to enhance robustness against stain variation. Our method is parameter-free and can be adopted in existing computational pathology pipelines with no significant changes. We evaluate the effectiveness of our method through experiments using a deep-learning pipeline for automatic nuclei segmentation on colorectal images. Our results show that by leveraging multiple reference images, better results can be achieved when generalizing to external data, where the staining can widely differ from the training set.
- [183] arXiv:2406.02859 (replaced) [pdf, other]
-
Title: ConPCO: Preserving Phoneme Characteristics for Automatic Pronunciation Assessment Leveraging Contrastive Ordinal RegularizationBi-Cheng Yan, Wei-Cheng Chao, Jiun-Ting Li, Yi-Cheng Wang, Hsin-Wei Wang, Meng-Shin Lin, Berlin ChenComments: This paper has been withdrawn because the authors aim to achieve better organization in writing and more detailed experimental analysisSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Automatic pronunciation assessment (APA) manages to evaluate the pronunciation proficiency of a second language (L2) learner in a target language. Existing efforts typically draw on regression models for proficiency score prediction, where the models are trained to estimate target values without explicitly accounting for phoneme-awareness in the feature space. In this paper, we propose a contrastive phonemic ordinal regularizer (ConPCO) tailored for regression-based APA models to generate more phoneme-discriminative features while considering the ordinal relationships among the regression targets. The proposed ConPCO first aligns the phoneme representations of an APA model and textual embeddings of phonetic transcriptions via contrastive learning. Afterward, the phoneme characteristics are retained by regulating the distances between inter- and intra-phoneme categories in the feature space while allowing for the ordinal relationships among the output targets. We further design and develop a hierarchical APA model to evaluate the effectiveness of our method. Extensive experiments conducted on the speechocean762 benchmark dataset suggest the feasibility and efficacy of our approach in relation to some cutting-edge baselines.
- [184] arXiv:2406.03111 (replaced) [pdf, html, other]
-
Title: Singing Voice Graph Modeling for SingFake DetectionComments: Accepted by Interspeech 2024; Our code is available at this https URLSubjects: Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Detecting singing voice deepfakes, or SingFake, involves determining the authenticity and copyright of a singing voice. Existing models for speech deepfake detection have struggled to adapt to unseen attacks in this unique singing voice domain of human vocalization. To bridge the gap, we present a groundbreaking SingGraph model. The model synergizes the capabilities of the MERT acoustic music understanding model for pitch and rhythm analysis with the wav2vec2.0 model for linguistic analysis of lyrics. Additionally, we advocate for using RawBoost and beat matching techniques grounded in music domain knowledge for singing voice augmentation, thereby enhancing SingFake detection performance. Our proposed method achieves new state-of-the-art (SOTA) results within the SingFake dataset, surpassing the previous SOTA model across three distinct scenarios: it improves EER relatively for seen singers by 13.2%, for unseen singers by 24.3%, and unseen singers using different codecs by 37.1%.
- [185] arXiv:2101.00337 (replaced) [pdf, html, other]
-
Title: Biologically Inspired Hexagonal Deep Learning for Hexagonal Image GenerationComments: Accepted for: 2020 27th IEEE International Conference on Image Processing (ICIP). arXiv admin note: text overlap with arXiv:1911.11251Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Whereas conventional state-of-the-art image processing systems of recording and output devices almost exclusively utilize square arranged methods, biological models, however, suggest an alternative, evolutionarily-based structure. Inspired by the human visual perception system, hexagonal image processing in the context of machine learning offers a number of key advantages that can benefit both researchers and users alike. The hexagonal deep learning framework Hexnet leveraged in this contribution serves therefore the generation of hexagonal images by utilizing hexagonal deep neural networks (H-DNN). As the results of our created test environment show, the proposed models can surpass current approaches of conventional image generation. While resulting in a reduction of the models' complexity in the form of trainable parameters, they furthermore allow an increase of test rates in comparison to their square counterparts.
- [186] arXiv:2208.06132 (replaced) [pdf, html, other]
-
Title: On the Physical Layer Security of Visible Light Communications Empowered by Gold NanoparticlesSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Visible light is a proper spectrum for secure wireless communications because of its high directivity and impermeability in indoor scenarios. However, if an eavesdropper is located very close to a legitimate receiver, secure communications become highly risky. In this paper, to further increase the level of security of visible light communication (VLC) and increase its resilience against to malicious attacks, we propose to capitalize on the recently synthesized gold nanoparticles (GNPs) with chiroptical properties for circularly polarized light resulting the phase retardation that interacts with the linear polarizer angle. GNP plates made by judiciously stacking many GNPs perform as physical secret keys. Transmitters send both the intended symbol and artificial noise to exploit the channel variation effect by the GNP plates, which is highly effective when an eavesdropper is closely located to the legitimate receiver. A new VLC channel model is first developed by representing the effect of GNP plates and linear polarizers in the circular polarization domain. Based on the new channel model, the angles of linear polarizers at the transmitters and legitimate receiver are optimized considering the effect of GNP plates to increase the secrecy rate in wiretapping scenarios. Simulations verify that when the transmitters are equipped with GNP plates, even if the eavesdropper is located right next to the legitimate receiver, insightful results on the physical layer security metrics are gained as follows: 1) the secrecy rate is significantly improved and 2) the symbol error rate gap between the legitimate receiver and eavesdropper becomes much larger due to the chiroptical properties of GNP plates.
- [187] arXiv:2210.00931 (replaced) [pdf, html, other]
-
Title: A New Class of Path-Following Method for Time-Varying Optimization with Optimal Parametric FunctionSubjects: Optimization and Control (math.OC); Signal Processing (eess.SP)
In this paper, we consider a formulation of nonlinear constrained optimization problems.
We reformulate it as a time-varying optimization using continuous-time parametric functions
and derive a dynamical system for tracking the optimal solution.
We then re-parameterize the dynamical system to express it based on a linear combination of the parametric functions.
Calculus of variations is applied to optimize the parametric functions,
so that the optimality distance of the solution is minimized.
Accordingly, an iterative dynamic algorithm, named as OP-TVO,
is devised to find the solution with an efficient convergence rate.
We benchmark the performance of the proposed algorithm with the prediction-correction method (PCM)
from the optimality and computational complexity point-of-views.
The results show that OP-TVO can compete with PCM
for the optimization problem of interest,
which indicates it can be a promising approach to replace PCM for some time-varying optimization problems.
Furthermore, this work provides a novel paradigm for solving parametric dynamical system. - [188] arXiv:2212.03228 (replaced) [pdf, html, other]
-
Title: ISAACS: Iterative Soft Adversarial Actor-Critic for SafetyComments: Accepted in 5th Annual Learning for Dynamics & Control Conference (L4DC), University of PennsylvaniaSubjects: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
The deployment of robots in uncontrolled environments requires them to operate robustly under previously unseen scenarios, like irregular terrain and wind conditions. Unfortunately, while rigorous safety frameworks from robust optimal control theory scale poorly to high-dimensional nonlinear dynamics, control policies computed by more tractable "deep" methods lack guarantees and tend to exhibit little robustness to uncertain operating conditions. This work introduces a novel approach enabling scalable synthesis of robust safety-preserving controllers for robotic systems with general nonlinear dynamics subject to bounded modeling error by combining game-theoretic safety analysis with adversarial reinforcement learning in simulation. Following a soft actor-critic scheme, a safety-seeking fallback policy is co-trained with an adversarial "disturbance" agent that aims to invoke the worst-case realization of model error and training-to-deployment discrepancy allowed by the designer's uncertainty. While the learned control policy does not intrinsically guarantee safety, it is used to construct a real-time safety filter (or shield) with robust safety guarantees based on forward reachability rollouts. This shield can be used in conjunction with a safety-agnostic control policy, precluding any task-driven actions that could result in loss of safety. We evaluate our learning-based safety approach in a 5D race car simulator, compare the learned safety policy to the numerically obtained optimal solution, and empirically validate the robust safety guarantee of our proposed safety shield against worst-case model discrepancy.
- [189] arXiv:2304.05600 (replaced) [pdf, html, other]
-
Title: Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation LearningComments: Accepted to CVPR 2024Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Audiovisual representation learning typically relies on the correspondence between sight and sound. However, there are often multiple audio tracks that can correspond with a visual scene. Consider, for example, different conversations on the same crowded street. The effect of such counterfactual pairs on audiovisual representation learning has not been previously explored. To investigate this, we use dubbed versions of movies and television shows to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks, differing only in speech, similarly to the same video. Our results, from a comprehensive set of experiments investigating different training strategies, show this general approach improves performance on a range of downstream auditory and audiovisual tasks, without majorly affecting linguistic task performance overall. These findings highlight the importance of considering speech variation when learning scene-level audiovisual correspondences and suggest that dubbed audio can be a useful augmentation technique for training audiovisual models toward more robust performance on diverse downstream tasks.
- [190] arXiv:2304.07954 (replaced) [pdf, html, other]
-
Title: Velocity Obstacle for Polytopic Collision Avoidance for Distributed Multi-robot SystemsComments: Accepted to IEEE Robotics and Automation Letters (RA-L) 2023, with open source repository releasedSubjects: Robotics (cs.RO); Systems and Control (eess.SY); Optimization and Control (math.OC)
Obstacle avoidance for multi-robot navigation with polytopic shapes is challenging. Existing works simplify the system dynamics or consider it as a convex or non-convex optimization problem with positive distance constraints between robots, which limits real-time performance and scalability. Additionally, generating collision-free behavior for polytopic-shaped robots is harder due to implicit and non-differentiable distance functions between polytopes. In this paper, we extend the concept of velocity obstacle (VO) principle for polytopic-shaped robots and propose a novel approach to construct the VO in the function of vertex coordinates and other robot's states. Compared with existing work about obstacle avoidance between polytopic-shaped robots, our approach is much more computationally efficient as the proposed approach for construction of VO between polytopes is optimization-free. Based on VO representation for polytopic shapes, we later propose a navigation approach for distributed multi-robot systems. We validate our proposed VO representation and navigation approach in multiple challenging scenarios including large-scale randomized tests, and our approach outperforms the state of art in many evaluation metrics, including completion rate, deadlock rate, and the average travel distance.
- [191] arXiv:2305.12284 (replaced) [pdf, html, other]
-
Title: Safely Learning Dynamical SystemsComments: 49 pages. arXiv admin note: text overlap with arXiv:2011.12257Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
A fundamental challenge in learning an unknown dynamical system is to reduce model uncertainty by making measurements while maintaining safety. We formulate a mathematical definition of what it means to safely learn a dynamical system by sequentially deciding where to initialize trajectories. The state of the system must stay within a safety region for a horizon of $T$ time steps under the action of all dynamical systems that (i) belong to a given initial uncertainty set, and (ii) are consistent with information gathered so far.
First, we consider safely learning a linear dynamical system involving $n$ states. For the case $T=1$, we present an LP-based algorithm that either safely recovers the true dynamics from at most $n$ trajectories, or certifies that safe learning is impossible. For $T=2$, we give an SDP representation of the set of safe initial conditions and show that $\lceil n/2 \rceil$ trajectories generically suffice for safe learning. For $T = \infty$, we provide SDP-representable inner approximations of the set of safe initial conditions and show that one trajectory generically suffices for safe learning. We extend a number of our results to the cases where the initial uncertainty set contains sparse, low-rank, or permutation matrices, or when the system has a control input.
Second, we consider safely learning a general class of nonlinear dynamical systems. For the case $T=1$, we give an SOCP-based representation of the set of safe initial conditions. For $T=\infty$, we provide semidefinite representable inner approximations to the set of safe initial conditions. We show how one can safely collect trajectories and fit a polynomial model of the nonlinear dynamics that is consistent with the initial uncertainty set and best agrees with the observations. We also present some extensions to cases where the measurements are noisy or the dynamical system involves disturbances. - [192] arXiv:2305.17834 (replaced) [pdf, html, other]
-
Title: Streaming Audio Transformers for Online Audio TaggingComments: Interspeech2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Transformers have emerged as a prominent model framework for audio tagging (AT), boasting state-of-the-art (SOTA) performance on the widely-used Audioset dataset. However, their impressive performance often comes at the cost of high memory usage, slow inference speed, and considerable model delay, rendering them impractical for real-world AT applications. In this study, we introduce streaming audio transformers (SAT) that combine the vision transformer (ViT) architecture with Transformer-Xl-like chunk processing, enabling efficient processing of long-range audio signals. Our proposed SAT is benchmarked against other transformer-based SOTA methods, achieving significant improvements in terms of mean average precision (mAP) at a delay of 2s and 1s, while also exhibiting significantly lower memory usage and computational overhead. Checkpoints are publicly available this https URL.
- [193] arXiv:2306.01891 (replaced) [pdf, html, other]
-
Title: DH-PTAM: A Deep Hybrid Stereo Events-Frames Parallel Tracking And Mapping SystemComments: Accepted for publication in the IEEE Transactions on Intelligent VehiclesJournal-ref: Vol.0, 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
This paper presents a robust approach for a visual parallel tracking and mapping (PTAM) system that excels in challenging environments. Our proposed method combines the strengths of heterogeneous multi-modal visual sensors, including stereo event-based and frame-based sensors, in a unified reference frame through a novel spatio-temporal synchronization of stereo visual frames and stereo event streams. We employ deep learning-based feature extraction and description for estimation to enhance robustness further. We also introduce an end-to-end parallel tracking and mapping optimization layer complemented by a simple loop-closure algorithm for efficient SLAM behavior. Through comprehensive experiments on both small-scale and large-scale real-world sequences of VECtor and TUM-VIE benchmarks, our proposed method (DH-PTAM) demonstrates superior performance in terms of robustness and accuracy in adverse conditions, especially in large-scale HDR scenarios. Our implementation's research-based Python API is publicly available on GitHub for further research and development: this https URL.
- [194] arXiv:2306.10236 (replaced) [pdf, html, other]
-
Title: Beat Pilot Tone (BPT): Simultaneous MR Imaging and RF Motion Sensing at Arbitrary FrequenciesComments: Accepted by the journal Magnetic Resonance in Medicine, April 2024Subjects: Medical Physics (physics.med-ph); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
Purpose: To introduce a simple system exploitation with the potential to turn MRI scanners into general-purpose RF motion monitoring systems.
Methods: Inspired by Pilot Tone (PT), this work proposes Beat Pilot Tone (BPT), in which two or more RF tones at arbitrary frequencies are transmitted continuously during the scan. These tones create motion-modulated standing wave patterns that are sensed by the receiver coil array, incidentally mixed by intermodulation in the receiver chain, and digitized simultaneously with the MRI data. BPT can operate at almost any frequency as long as the intermodulation products lie within the bandwidth of the receivers. BPT's mechanism is explained in electromagnetic simulations and validated experimentally.
Results: Phantom and volunteer experiments over a range of transmit frequencies suggest that BPT may offer frequency-dependent sensitivity to motion. Using a semi-flexible body receiver array, BPT appears to sense cardiac-induced body vibrations at microwave frequencies (1.2 GHz and greater). At lower frequencies, it exhibits a similar cardiac signal shape to PT, likely due to blood volume changes.
Other volunteer experiments with respiratory, bulk, and head motion show that BPT can achieve greater sensitivity to motion than PT and greater separability between motion types. Basic multiple-input multiple-output (4x22 MIMO) operation with simultaneous PT and BPT in head motion is demonstrated using two transmit antennas and a 22-channel head-neck coil.
Conclusion: BPT may offer a rich source of motion information that is frequency-dependent, simultaneous, and complementary to PT and the MRI exam. - [195] arXiv:2307.14281 (replaced) [pdf, html, other]
-
Title: Moments of Autocorrelation Demerit Factors of Binary SequencesComments: 41 pagesSubjects: Information Theory (cs.IT); Discrete Mathematics (cs.DM); Signal Processing (eess.SP); Combinatorics (math.CO); Probability (math.PR)
Sequences with low aperiodic autocorrelation are used in communications and remote sensing for synchronization and ranging. The autocorrelation demerit factor of a sequence is the sum of the squared magnitudes of its autocorrelation values at every nonzero shift when we normalize the sequence to have unit Euclidean length. The merit factor, introduced by Golay, is the reciprocal of the demerit factor. We consider the uniform probability measure on the $2^\ell$ binary sequences of length $\ell$ and investigate the distribution of the demerit factors of these sequences. Sarwate and Jedwab have respectively calculated the mean and variance of this distribution. We develop new combinatorial techniques to calculate the $p$th central moment of the demerit factor for binary sequences of length $\ell$. These techniques prove that for $p\geq 2$ and $\ell \geq 4$, all the central moments are strictly positive. For any given $p$, one may use the technique to obtain an exact formula for the $p$th central moment of the demerit factor as a function of the length $\ell$. Jedwab's formula for variance is confirmed by our technique with a short calculation, and we go beyond previous results by also deriving an exact formula for the skewness. A computer-assisted application of our method also obtains exact formulas for the kurtosis, which we report here, as well as the fifth central moment.
- [196] arXiv:2311.10320 (replaced) [pdf, other]
-
Title: Learning transformer-based heterogeneously salient graph representation for multimodal remote sensing image classificationSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Data collected by different modalities can provide a wealth of complementary information, such as hyperspectral image (HSI) to offer rich spectral-spatial properties, synthetic aperture radar (SAR) to provide structural information about the Earth's surface, and light detection and ranging (LiDAR) to cover altitude information about ground elevation. Therefore, a natural idea is to combine multimodal images for refined and accurate land-cover interpretation. Although many efforts have been attempted to achieve multi-source remote sensing image classification, there are still three issues as follows: 1) indiscriminate feature representation without sufficiently considering modal heterogeneity, 2) abundant features and complex computations associated with modeling long-range dependencies, and 3) overfitting phenomenon caused by sparsely labeled samples. To overcome the above barriers, a transformer-based heterogeneously salient graph representation (THSGR) approach is proposed in this paper. First, a multimodal heterogeneous graph encoder is presented to encode distinctively non-Euclidean structural features from heterogeneous data. Then, a self-attention-free multi-convolutional modulator is designed for effective and efficient long-term dependency modeling. Finally, a mean forward is put forward in order to avoid overfitting. Based on the above structures, the proposed model is able to break through modal gaps to obtain differentiated graph representation with competitive time cost, even for a small fraction of training samples. Experiments and analyses on three benchmark datasets with various state-of-the-art (SOTA) methods show the performance of the proposed approach.
- [197] arXiv:2311.12056 (replaced) [pdf, html, other]
-
Title: Kuro Siwo: 33 billion $m^2$ under the water. A global multi-temporal satellite dataset for rapid flood mappingNikolaos Ioannis Bountos, Maria Sdraka, Angelos Zavras, Ilektra Karasante, Andreas Karavias, Themistocles Herekakis, Angeliki Thanasou, Dimitrios Michail, Ioannis PapoutsisSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Global floods, exacerbated by climate change, pose severe threats to human life, infrastructure, and the environment. Recent catastrophic events in Pakistan and New Zealand underscore the urgent need for precise flood mapping to guide restoration efforts, understand vulnerabilities, and prepare for future occurrences. While Synthetic Aperture Radar (SAR) remote sensing offers day-and-night, all-weather imaging capabilities, its application in deep learning for flood segmentation is limited by the lack of large annotated datasets. To address this, we introduce Kuro Siwo, a manually annotated multi-temporal dataset, spanning 43 flood events globally. Our dataset maps more than 338 billion $m^2$ of land, with 33 billion designated as either flooded areas or permanent water bodies. Kuro Siwo includes a highly processed product optimized for flood mapping based on SAR Ground Range Detected, and a primal SAR Single Look Complex product with minimal preprocessing, designed to promote research on the exploitation of both the phase and amplitude information and to offer maximum flexibility for downstream task preprocessing. To leverage advances in large scale self-supervised pretraining methods for remote sensing data, we augment Kuro Siwo with a large unlabeled set of SAR samples. Finally, we provide an extensive benchmark, namely BlackBench, offering strong baselines for a diverse set of flood events from Europe, America, Africa, Asia and Australia.
- [198] arXiv:2312.01053 (replaced) [pdf, html, other]
-
Title: End-to-End Speech-to-Text Translation: A SurveyComments: 75 pagesSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Speech-to-text translation pertains to the task of converting speech signals in a language to text in another language. It finds its application in various domains, such as hands-free communication, dictation, video lecture transcription, and translation, to name a few. Automatic Speech Recognition (ASR), as well as Machine Translation(MT) models, play crucial roles in traditional ST translation, enabling the conversion of spoken language in its original form to written text and facilitating seamless cross-lingual communication. ASR recognizes spoken words, while MT translates the transcribed text into the target language. Such disintegrated models suffer from cascaded error propagation and high resource and training costs. As a result, researchers have been exploring end-to-end (E2E) models for ST translation. However, to our knowledge, there is no comprehensive review of existing works on E2E ST. The present survey, therefore, discusses the work in this direction. Our attempt has been to provide a comprehensive review of models employed, metrics, and datasets used for ST tasks, providing challenges and future research direction with new insights. We believe this review will be helpful to researchers working on various applications of ST models.
- [199] arXiv:2312.15289 (replaced) [pdf, html, other]
-
Title: Fr\'echet Wavelet Distance: A Domain-Agnostic Metric for Image GenerationLokesh Veeramacheneni (University of Bonn), Moritz Wolter (University of Bonn), Hildegard Kuehne (University of Bonn), Juergen Gall (University of Bonn)Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Modern metrics for generative learning like Fréchet Inception Distance (FID) demonstrate impressive performance. However, they suffer from various shortcomings, like a bias towards specific generators and datasets. To address this problem, we propose the Fréchet Wavelet Distance (FWD) as a domain-agnostic metric based on Wavelet Packet Transform ($W_p$). FWD provides a sight across a broad spectrum of frequencies in images with a high resolution, along with preserving both spatial and textural aspects. Specifically, we use Wp to project generated and dataset images to packet coefficient space. Further, we compute Fréchet distance with the resultant coefficients to evaluate the quality of a generator. This metric is general-purpose and dataset-domain agnostic, as it does not rely on any pre-trained network while being more interpretable because of frequency band transparency. We conclude with an extensive evaluation of a wide variety of generators across various datasets that the proposed FWD is able to generalize and improve robustness to domain shift and various corruptions compared to other metrics.
- [200] arXiv:2401.05725 (replaced) [pdf, html, other]
-
Title: Energy-Efficient STAR-RIS Enhanced UAV-Enabled MEC Networks with Bi-Directional Task OffloadingSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
This paper introduces a novel multi-user mobile edge computing (MEC) scheme facilitated by the simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) and the unmanned aerial vehicle (UAV). Unlike existing MEC approaches, the proposed scheme enables bidirectional offloading, allowing users to concurrently offload tasks to the MEC servers located at the ground base station (BS) and UAV with STAR-RIS support. Specifically, we formulate an optimization problem aiming at maximizing the energy efficiency of the system while ensuring the quality of service (QoS) constraints by jointly optimizing the resource allocation, user scheduling, passive beamforming of the STAR-RIS, and the UAV trajectory. A block coordinate descent (BCD) iterative algorithm designed with the Dinkelbach's algorithm and the successive convex approximation (SCA) technique is proposed to effectively handle the formulated non-convex optimization problem with significant coupling among variables. Simulation results indicate that the proposed STAR-RIS enhanced UAV-enabled MEC scheme possesses significant advantages in enhancing the system energy efficiency over other baseline schemes including the conventional RIS-aided scheme.
- [201] arXiv:2402.04012 (replaced) [pdf, other]
-
Title: Quantized Approximately Orthogonal Recurrent Neural NetworksSubjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP); Statistics Theory (math.ST)
In recent years, Orthogonal Recurrent Neural Networks (ORNNs) have gained popularity due to their ability to manage tasks involving long-term dependencies, such as the copy-task, and their linear complexity. However, existing ORNNs utilize full precision weights and activations, which prevents their deployment on compact this http URL this paper, we explore the quantization of the weight matrices in ORNNs, leading to Quantized approximately Orthogonal RNNs (QORNNs). The construction of such networks remained an open problem, acknowledged for its inherent instability. We propose and investigate two strategies to learn QORNN by combining quantization-aware training (QAT) and orthogonal projections. We also study post-training quantization of the activations for pure integer computation of the recurrent loop. The most efficient models achieve results similar to state-of-the-art full-precision ORNN, LSTM and FastRNN on a variety of standard benchmarks, even with 4-bits quantization.
- [202] arXiv:2402.09508 (replaced) [pdf, html, other]
-
Title: Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based ControlsSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Controllable music generation plays a vital role in human-AI music co-creation. While Large Language Models (LLMs) have shown promise in generating high-quality music, their focus on autoregressive generation limits their utility in music editing tasks. To address this gap, we propose a novel approach leveraging a parameter-efficient heterogeneous adapter combined with a masking training scheme. This approach enables autoregressive language models to seamlessly address music inpainting tasks. Additionally, our method integrates frame-level content-based controls, facilitating track-conditioned music refinement and score-conditioned music arrangement. We apply this method to fine-tune MusicGen, a leading autoregressive music generation model. Our experiments demonstrate promising results across multiple music editing tasks, offering more flexible controls for future AI-driven music editing tools. The source codes and a demo page showcasing our work are available at this https URL.
- [203] arXiv:2402.18859 (replaced) [pdf, other]
-
Title: Taking Second-life Batteries from Exhausted to Empowered using Experiments, Data Analysis, and Health EstimationComments: 16 pages, 8 figuresSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
The reuse of retired electric vehicle batteries in grid energy storage offers environmental and economic benefits. This study concentrates on health monitoring algorithms for retired batteries deployed in grid storage. Over 15 months of testing, we collect, analyze, and publicize a dataset of second-life batteries, implementing a cycling protocol simulating grid energy storage load profiles within a 3-4 V voltage window. Four machine-learning-based health estimation models, relying on online-accessible features and initial capacity, are compared, with the selected model achieving a mean absolute percentage error below 2.3% on test data. Additionally, an adaptive online health estimation algorithm is proposed by integrating a clustering-based method, thus limiting estimation errors during online deployment. These results showcase the feasibility of repurposing retired batteries for second-life applications. Based on obtained data and power demand, these second-life batteries exhibit potential for over a decade of grid energy storage use.
- [204] arXiv:2403.04847 (replaced) [pdf, html, other]
-
Title: Solving Inverse Problems with Model Mismatch using Untrained Neural Networks within Model-based ArchitecturesComments: Published in Transactions in Machine Learning Research (TMLR)Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Model-based deep learning methods such as loop unrolling (LU) and deep equilibrium model}(DEQ) extensions offer outstanding performance in solving inverse problems (IP). These methods unroll the optimization iterations into a sequence of neural networks that in effect learn a regularization function from data. While these architectures are currently state-of-the-art in numerous applications, their success heavily relies on the accuracy of the forward model. This assumption can be limiting in many physical applications due to model simplifications or uncertainties in the apparatus. To address forward model mismatch, we introduce an untrained forward model residual block within the model-based architecture to match the data consistency in the measurement domain for each instance. We propose two variants in well-known model-based architectures (LU and DEQ) and prove convergence under mild conditions. Our approach offers a unified solution that is less parameter-sensitive, requires no additional data, and enables simultaneous fitting of the forward model and reconstruction in a single pass, benefiting both linear and nonlinear inverse problems. The experiments show significant quality improvement in removing artifacts and preserving details across three distinct applications, encompassing both linear and nonlinear inverse problems. Moreover, we highlight reconstruction effectiveness in intermediate steps and showcase robustness to random initialization of the residual block and a higher number of iterations during evaluation. Code is available at \texttt{this https URL}.
- [205] arXiv:2403.09151 (replaced) [pdf, html, other]
-
Title: MPC without Terminal Ingredients Tailored to the SEIR Compartmental Epidemic ModelComments: 28 pages, 6 figures, preprint under peer reviewSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
We consider the SEIR compartmental epidemic model subject to state and input constraints (a cap on the proportion of infectious individuals and limits on the allowed social distancing and quarantining measures, respectively). We present a tailored model predictive control (MPC) scheme without terminal conditions. We rigorously show recursive feasibility and asymptotic convergence of the MPC closed loop to the continuum of disease-free equilibrium points for suitably designed quadratic running cost and a sufficiently long prediction horizon (forecast window). Moreover, we establish the viability kernel (a.k.a. the admissible set) as a domain of attraction of the continuum of equilibria.
- [206] arXiv:2404.03211 (replaced) [pdf, other]
-
Title: Convergence Conditions of Online Regularized Statistical Learning in Reproducing Kernel Hilbert Space With Non-Stationary DataSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
We study the convergence of recursive regularized learning algorithms in the reproducing kernel Hilbert space (RKHS) with dependent and non-stationary online data streams. Firstly, we study the mean square asymptotic stability of a class of random difference equations in RKHS, whose non-homogeneous terms are martingale difference sequences dependent on the homogeneous ones. Secondly, we introduce the concept of random Tikhonov regularization path, and show that if the regularization path is slowly time-varying in some sense, then the output of the algorithm is consistent with the regularization path in mean square. Furthermore, if the data streams also satisfy the RKHS persistence of excitation condition, i.e. there exists a fixed length of time period, such that the conditional expectation of the operators induced by the input data accumulated over every time period has a uniformly strictly positive compact lower bound in the sense of the operator order with respect to time, then the output of the algorithm is consistent with the unknown function in mean square. Finally, for the case with independent and non-identically distributed data streams, the algorithm achieves the mean square consistency provided the marginal probability measures induced by the input data are slowly time-varying and the average measure over each fixed-length time period has a uniformly strictly positive lower bound.
- [207] arXiv:2405.03393 (replaced) [pdf, html, other]
-
Title: On-site scale factor linearity calibration of MEMS triaxial gyroscopesSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
The calibration of MEMS triaxial gyroscopes is crucial for achieving precise attitude estimation for various wearable health monitoring applications. However, gyroscope calibration poses greater challenges compared to accelerometers and magnetometers. This paper introduces an efficient method for calibrating MEMS triaxial gyroscopes via only a servo motor, making it well-suited for field environments. The core strategy of the method involves utilizing the fact that the dot product of the measured gravity and the rotational speed in a fixed frame remains constant. To eliminate the influence of rotating centrifugal force on the accelerometer, the accelerometer data is measured while stationary. The proposed calibration experiment scheme, which allows gyroscopic measurements when operating each axis at a specific rotation speed, making it easier to evaluate the linearity across a related speed range constituted by a series of rotation speeds. Moreover, solely the classical least squares algorithm proves adequate for estimating the scale factor, notably streamlining the analysis of the calibration process. Extensive numerical simulations were conducted to analyze the proposed method's performance in calibrating a triaxial gyroscope model. Experimental validation was also carried out using a commercially available MEMS inertial measurement unit (LSM9DS1 from Arduino nano 33 BLE SENSE) and a servo motor capable of controlling precise speed. The experimental results effectively demonstrate the efficacy of the proposed calibration approach.
- [208] arXiv:2405.04354 (replaced) [pdf, html, other]
-
Title: A transversality theorem for semi-algebraic sets with application to signal recovery from the second moment and cryo-EMSubjects: Information Theory (cs.IT); Signal Processing (eess.SP); Algebraic Geometry (math.AG)
Semi-algebraic priors are ubiquitous in signal processing and machine learning. Prevalent examples include a) linear models where the signal lies in a low-dimensional subspace; b) sparse models where the signal can be represented by only a few coefficients under a suitable basis; and c) a large family of neural network generative models. In this paper, we prove a transversality theorem for semi-algebraic sets in orthogonal or unitary representations of groups: with a suitable dimension bound, a generic translate of any semi-algebraic set is transverse to the orbits of the group action. This, in turn, implies that if a signal lies in a low-dimensional semi-algebraic set, then it can be recovered uniquely from measurements that separate orbits.
As an application, we consider the implications of the transversality theorem to the problem of recovering signals that are translated by random group actions from their second moment. As a special case, we discuss cryo-EM: a leading technology to constitute the spatial structure of biological molecules, which serves as our prime motivation. In particular, we derive explicit bounds for recovering a molecular structure from the second moment under a semi-algebraic prior and deduce information-theoretic implications. We also obtain information-theoretic bounds for three additional applications: factoring Gram matrices, multi-reference alignment, and phase retrieval. Finally, we deduce bounds for designing permutation invariant separators in machine learning. - [209] arXiv:2405.06290 (replaced) [pdf, other]
-
Title: Path Planning and Motion Control for Accurate Positioning of Car-like RobotsComments: The paper needs further revision to guarantee technical correctness and concisenessSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
This paper investigates the planning and control for accurate positioning of car-like robots. We propose a solution that integrates two modules: a motion planner, facilitated by the rapidly-exploring random tree algorithm and continuous-curvature (CC) steering technique, generates a CC trajectory as a reference; and a nonlinear model predictive controller (NMPC) regulates the robot to accurately track the reference trajectory. Based on the $\mu$-tangency conditions in prior art, we derive explicit existence conditions and develop associated computation methods for a special class of CC paths which not only admit the same driving patterns as Reeds-Shepp paths but also consist of cusp-free clothoid turns. Afterwards, we create an autonomous vehicle parking scenario where the NMPC endeavors to follow the reference trajectory. Feasibility and computational efficiency of the CC steering are validated by numerical simulation. CarSim-Simulink joint simulations statistically verify that with exactly same NMPC, the closed-loop system with CC trajectories as references substantially outperforms the case where Reeds-Shepp trajectories are used as references.
- [210] arXiv:2405.12728 (replaced) [pdf, html, other]
-
Title: Leveraging Neural Radiance Fields for Pose Estimation of an Unknown Space Object during Proximity OperationsComments: Accepted at IEEE International Conference on Space Robotics 2024 (ISpaRo 2024)Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
We address the estimation of the 6D pose of an unknown target spacecraft relative to a monocular camera, a key step towards the autonomous rendezvous and proximity operations required by future Active Debris Removal missions. We present a novel method that enables an "off-the-shelf" spacecraft pose estimator, which is supposed to known the target CAD model, to be applied on an unknown target. Our method relies on an in-the wild NeRF, i.e., a Neural Radiance Field that employs learnable appearance embeddings to represent varying illumination conditions found in natural scenes. We train the NeRF model using a sparse collection of images that depict the target, and in turn generate a large dataset that is diverse both in terms of viewpoint and illumination. This dataset is then used to train the pose estimation network. We validate our method on the Hardware-In-the-Loop images of SPEED+ that emulate lighting conditions close to those encountered on orbit. We demonstrate that our method successfully enables the training of an off-the-shelf spacecraft pose estimation network from a sparse set of images. Furthermore, we show that a network trained using our method performs similarly to a model trained on synthetic images generated using the CAD model of the target.
- [211] arXiv:2405.18255 (replaced) [pdf, html, other]
-
Title: Channel Reciprocity Based Attack Detection for Securing UWB Ranging by AutoencoderSubjects: Cryptography and Security (cs.CR); Social and Information Networks (cs.SI); Signal Processing (eess.SP)
A variety of ranging threats represented by Ghost Peak attack have raised concerns regarding the security performance of Ultra-Wide Band (UWB) systems with the finalization of the IEEE 802.15.4z standard. Based on channel reciprocity, this paper proposes a low complexity attack detection scheme that compares Channel Impulse Response (CIR) features of both ranging sides utilizing an autoencoder with the capability of data compression and feature extraction. Taking Ghost Peak attack as an example, this paper demonstrates the effectiveness, feasibility and generalizability of the proposed attack detection scheme through simulation and experimental validation. The proposed scheme achieves an attack detection success rate of over 99% and can be implemented in current systems at low cost.
- [212] arXiv:2406.02178 (replaced) [pdf, html, other]
-
Title: Audio Mamba: Selective State Spaces for Self-Supervised Audio RepresentationsComments: Accepted at INTERSPEECH 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Despite its widespread adoption as the prominent neural architecture, the Transformer has spurred several independent lines of work to address its limitations. One such approach is selective state space models, which have demonstrated promising results for language modelling. However, their feasibility for learning self-supervised, general-purpose audio representations is yet to be investigated. This work proposes Audio Mamba, a selective state space model for learning general-purpose audio representations from randomly masked spectrogram patches through self-supervision. Empirical results on ten diverse audio recognition downstream tasks show that the proposed models, pretrained on the AudioSet dataset, consistently outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by a considerable margin and demonstrate better performance in dataset size, sequence length and model size comparisons.
- [213] arXiv:2406.02315 (replaced) [pdf, html, other]
-
Title: An Independence-promoting Loss for Music Generation with Language ModelsComments: Accepted to ICML 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Music generation schemes using language modeling rely on a vocabulary of audio tokens, generally provided as codes in a discrete latent space learnt by an auto-encoder. Multi-stage quantizers are often employed to produce these tokens, therefore the decoding strategy used for token prediction must be adapted to account for multiple codebooks: either it should model the joint distribution over all codebooks, or fit the product of the codebook marginal distributions. Modelling the joint distribution requires a costly increase in the number of auto-regressive steps, while fitting the product of the marginals yields an inexact model unless the codebooks are mutually independent. In this work, we introduce an independence-promoting loss to regularize the auto-encoder used as the tokenizer in language models for music generation. The proposed loss is a proxy for mutual information based on the maximum mean discrepancy principle, applied in reproducible kernel Hilbert spaces. Our criterion is simple to implement and train, and it is generalizable to other multi-stream codecs. We show that it reduces the statistical dependence between codebooks during auto-encoding. This leads to an increase in the generated music quality when modelling the product of the marginal distributions, while generating audio much faster than the joint distribution model.
- [214] arXiv:2406.02897 (replaced) [pdf, html, other]
-
Title: LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete CodesSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Prior works have demonstrated zero-shot text-to-speech by using a generative language model on audio tokens obtained via a neural audio codec. It is still challenging, however, to adapt them to low-latency scenarios. In this paper, we present LiveSpeech - a fully autoregressive language model-based approach for zero-shot text-to-speech, enabling low-latency streaming of the output audio. To allow multiple token prediction within a single decoding step, we propose (1) using adaptive codebook loss weights that consider codebook contribution in each frame and focus on hard instances, and (2) grouping codebooks and processing groups in parallel. Experiments show our proposed models achieve competitive results to state-of-the-art baselines in terms of content accuracy, speaker similarity, audio quality, and inference speed while being suitable for low-latency streaming applications.
- [215] arXiv:2406.03240 (replaced) [pdf, html, other]
-
Title: Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion StrategyYuankun Xie, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Xiaopeng Wang, Haonnan Cheng, Long Ye, Jianhua TaoComments: Accepted by INTERSPEECH 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
With the proliferation of deepfake audio, there is an urgent need to investigate their attribution. Current source tracing methods can effectively distinguish in-distribution (ID) categories. However, the rapid evolution of deepfake algorithms poses a critical challenge in the accurate identification of out-of-distribution (OOD) novel deepfake algorithms. In this paper, we propose Real Emphasis and Fake Dispersion (REFD) strategy for audio deepfake algorithm recognition, demonstrating its effectiveness in discriminating ID samples while identifying OOD samples. For effective OOD detection, we first explore current post-hoc OOD methods and propose NSD, a novel OOD approach in identifying novel deepfake algorithms through the similarity consideration of both feature and logits scores. REFD achieves 86.83% F1-score as a single system in Audio Deepfake Detection Challenge 2023 Track3, showcasing its state-of-the-art performance.
- [216] arXiv:2406.03247 (replaced) [pdf, html, other]
-
Title: Genuine-Focused Learning using Mask AutoEncoder for Generalized Fake Audio DetectionXiaopeng Wang, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Yuankun Xie, Yukun Liu, Jianhua Tao, Xuefei Liu, Yongwei Li, Xin Qi, Yi Lu, Shuchen ShiComments: Accepted by INTERSPEECH 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
The generalization of Fake Audio Detection (FAD) is critical due to the emergence of new spoofing techniques. Traditional FAD methods often focus solely on distinguishing between genuine and known spoofed audio. We propose a Genuine-Focused Learning (GFL) framework guided, aiming for highly generalized FAD, called GFL-FAD. This method incorporates a Counterfactual Reasoning Enhanced Representation (CRER) based on audio reconstruction using the Mask AutoEncoder (MAE) architecture to accurately model genuine audio features. To reduce the influence of spoofed audio during training, we introduce a genuine audio reconstruction loss, maintaining the focus on learning genuine data features. In addition, content-related bottleneck (BN) features are extracted from the MAE to supplement the knowledge of the original audio. These BN features are adaptively fused with CRER to further improve robustness. Our method achieves state-of-the-art performance with an EER of 0.25% on ASVspoof2019 LA.
- [217] arXiv:2406.03645 (replaced) [pdf, other]
-
Title: Partial Label Learning with Focal Loss for Sea Ice Classification Based on Ice ChartsBehzad Vahedi, Benjamin Lucas, Farnoush Banaei-Kashani, Andrew P. Barrett, Walter N. Meier, Siri Jodha Khalsa, Morteza KarimzadehComments: Updated DOI and copyright info. Accepted for publication at the IEEE Journal of Selected Topics in Applied Earth Observations and Remote SensingSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Sea ice, crucial to the Arctic and Earth's climate, requires consistent monitoring and high-resolution mapping. Manual sea ice mapping, however, is time-consuming and subjective, prompting the need for automated deep learning-based classification approaches. However, training these algorithms is challenging because expert-generated ice charts, commonly used as training data, do not map single ice types but instead map polygons with multiple ice types. Moreover, the distribution of various ice types in these charts is frequently imbalanced, resulting in a performance bias towards the dominant class. In this paper, we present a novel GeoAI approach to training sea ice classification by formalizing it as a partial label learning task with explicit confidence scores to address multiple labels and class imbalance. We treat the polygon-level labels as candidate partial labels, assign the corresponding ice concentrations as confidence scores to each candidate label, and integrate them with focal loss to train a Convolutional Neural Network (CNN). Our proposed approach leads to enhanced performance for sea ice classification in Sentinel-1 dual-polarized SAR images, improving classification accuracy (from 87% to 92%) and weighted average F-1 score (from 90% to 93%) compared to the conventional training approach of using one-hot encoded labels and Categorical Cross-Entropy loss. It also improves the F-1 score in 4 out of the 6 sea ice classes.
- [218] arXiv:2406.04112 (replaced) [pdf, html, other]
-
Title: Compressible Dynamics in Deep Overparameterized Low-Rank Learning & AdaptationComments: Accepted at ICML'24 (Oral)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Machine Learning (stat.ML)
While overparameterization in machine learning models offers great benefits in terms of optimization and generalization, it also leads to increased computational requirements as model sizes grow. In this work, we show that by leveraging the inherent low-dimensional structures of data and compressible dynamics within the model parameters, we can reap the benefits of overparameterization without the computational burdens. In practice, we demonstrate the effectiveness of this approach for deep low-rank matrix completion as well as fine-tuning language models. Our approach is grounded in theoretical findings for deep overparameterized low-rank matrix recovery, where we show that the learning dynamics of each weight matrix are confined to an invariant low-dimensional subspace. Consequently, we can construct and train compact, highly compressed factorizations possessing the same benefits as their overparameterized counterparts. In the context of deep matrix completion, our technique substantially improves training efficiency while retaining the advantages of overparameterization. For language model fine-tuning, we propose a method called "Deep LoRA", which improves the existing low-rank adaptation (LoRA) technique, leading to reduced overfitting and a simplified hyperparameter setup, while maintaining comparable efficiency. We validate the effectiveness of Deep LoRA on natural language tasks, particularly when fine-tuning with limited data. Our code is available at this https URL.