AI-based Time-Series Regression

Sequence-based Learning – From Dealing with Noisy Data in Multivariate Learning Processes to Multi-level Time Series Forecasting

Tobias Feigl1,2
1Fraunhofer Institute of Integrated Circuits (IIS), Nuremberg, Germany
2Friedrich-Alexander-University Erlangen-Nuremberg (FAU), Germany

Keywords: Eng.: Denoising, stochastic random walk noise, context vector capacity, recurrent neural network (RNN), long-short-term memory cell (LSTM), gated recurrent unit (GRU), bidrectional LSTM (BLSTM), temporal convolutional network (TCN), time delayed neural network (TDN), random forest (RF), transformer, multi-head attention, residual networks (ResNet), autoregressive method, seasonal arima (SARIMA), prophet, time series classification, time series regression, time series analysis, time series anomaly detection, dynamic time warping (DTW)

Figure 1: Exemplary processing chain for time series data with deep learning.

Research on the analysis of time series has gained momentum in recent years, as findings from time series analysis can improve the decision-making process for industrial and scientific areas. Time series analysis aims to describe patterns and developments that appear in data over time. Among the many useful applications of time series analysis, the classification, regression, prognosis and anomaly detection of points in time and events in sequences (time series) are particularly noteworthy, as they contribute important information, for example to the decision-making of companies. In today’s information-driven world, countless numerical time series are generated by industry and researchers every day. For many applications – including biology, medicine, finance and industry – high-dimensional time series are required. Dealing with such extensive data sets poses various new and interesting challenges.


The first models developed for analyzing time series were univariate models, such as ARIMA, which are based on the principle of automatic regression. In such models, historical observations are used to make future predictions. However, in many areas, especially in the areas of digital signal processing and economics (see Figure 1), several correlated statistical (random) variables must be analyzed at the same time. Since univariate forecasting approaches do not consider potentially usable data from other time series in the same data set, multivariate analysis models were developed that include such data in their analysis, e.g. the Vector Auto-Regressive (VAR) model. Such models are popular to this day and are used independently or in combination with non-linear modeling with neural networks.

Challenges in natural processes

Despite significant developments in multivariate analytical modeling, problems still arise when dealing with high-dimensional data because not all variables have a direct influence on the target variable. This makes forecasts inaccurate if unrelated variables are taken into account. This is often the case in practical applications such as signal processing. Natural processes, as we find them in the applications, process data that is described by a multivariate stochastic process to take into account relationships that exist between the individual time series. However, it is impossible to describe multivariate stochastic processes using naive approaches.

Figure 2: Exemplary processing chain for time series data with deep learning.

Applications such as “efficient search and representation of tracking data” processes time series data, namely video material, the information of which, the trajectories of players and ball, can turn out randomly due to mutual obscuration of these objects, e.g., players fight for the ball, and non-linear motion curves of a game scene. By means of time series analyzes, these random occlusions can be resolved through the course over time or the development process. Applications such as “data-driven localization” process time series data whose clocks for system synchronization, multipath propagation, fading effects, temperature, movement dynamics in the system are subject to long-term stochastic noise. Temporal relationships in the data can help to remove static noise (so-called denoising) to free up informative features Furthermore, data-driven movement models, which are used to analyze movements over time, increase the accuracy of the position prediction (so-called forecasting), which otherwise suffers from the greatly simplified description of conventional model-driven filters. Applications such as “Comprehensible AI for Automotive” process data from a complex signal processing chain and uses the time sequence to identify connections between different sources of information and higher-value actions, e.g., achieve seconds “to predict. Applications such as “intelligent power electronics” and “AI-supported status and fault diagnosis radio systems” also process time series data from different signal processing chains. In both cases it is necessary to monitor the data streams over time to distinguish desired from abnormal changes. The course over time makes it possible to identify and localize sources of interference that cannot be identified from the snapshot perspective of the data.

Why is deep learning used today for time series analysis when we have auto-regressive methods and stochastic processes cannot be modeled?

Together with the application researchers, the underlying application-specific data were analyzed and suitable levels of abstraction were identified to develop hybrid processes, despite the natural stochastic noise and the information complexity, which remove noise from the data using the non-linear function approximators of deep learning (DL) and with downstream Bayesian methods find anomalies or regress or classify target variables.

In contrast to the hand-made automatic regressors and Bayesian filters, DL methods offer promising possibilities for the prediction of time series, such as the automatic learning of the time dependency and the automatic handling of temporal structures such as trends and seasonality directly from the data and optionally offer clearly more parameterization options. With their quality of extracting patterns from the input data over long periods of time, they are very suitable for forecasting. You can therefore deal with large amounts of data, several complex variables, multivariate, and multi-level actions (input and output values) that are required for the prediction of time series. DL offers easy-to-extract functions and reduces the need for feature engineering processes, data scaling methods and stationary data that are required for the prediction of time series. These networks learn independently and during training themselves extract features from the raw input data that are required for the prediction of time series. Time series data can be very irregular and complex. DL methods make no assumptions about the underlying pattern in the data, make no strong assumptions about the mapping function, learn linear and nonlinear relationships, and are also more robust to noise in input data and in the mapping function, which is quite common in time series data.

Research focus of the competence

Various research priorities were identified together with different applications:

Application-specific model optimization. The main focus is on suitable data collection “What information must be collected?”, Data analysis “Which level of abstraction is optimal for the problem and process at hand?”, Data preprocessing “How must the data be normalized and standardized to make meaningful predictions?” And derivation the optimal characteristics and architecture for a defined and as atomic problem as possible to deal with stochastic processes. In an initial analysis phase, application-specific, optimal classification and regression methods for target categories and variables but also for the identification, detection and prognosis of anomalies are derived. It is also examined what effects the fusion of temporal, spatial, spectral and mixed information extractors have on the quality of the results. Another application-specific focus is the investigation of the effects of the temporal architecture of neural networks such as context vectors in long-term-short-term memory cell LSTM and attention and traceability of long-term, short-term and future dependencies in continuous information.

Minimization of uncertainty in forecasting methods. Another focus is to reduce the uncertainty of the prediction method “How can the error variance and the bias of the prediction be reduced with increasing complexity and dimension of the data?” For this reason, among other things, the effects of Monte Carlo dropout methods on the model accuracy and uncertainty and their balancing are researched in the competence column and the deep coupling of temporal neural networks with Bayesian methods for reliable prediction is examined.

Superordinate role

Time series data are omnipresent in the overall project: even if they are not always directly obvious, from the point of view of the method it is almost always sensible to identify temporal relationships in the underlying data and information. Often additional temporal intercorrelations are hidden in the data, which should be exploited profitably for the solution. The number of scientific contributions to the competence column shows that there is great interest in time series-based learning processes in both methodological and application-centered research communities.

I. Find the right processing method for time series analysis

Know your data and optimize your architecture. The main objective in dealing with time series data is typically to identify the optimal procedure and its architecture, or function tree and parameters, for dealing with the existing continuous data streams and to answer questions such as “Which size of the sliding window contains the right amount of information to allow a Problem to be mapped? ”Or“ When do the sequences contain mainly redundant information? ”And“ Which architecture, which model family has to be parameterized and how? ”. To select the suitable method for analyzing your own available time series data, typical data analysis steps are therefore first necessary, see Figure 2. Using search methods, the optimal architecture and optimal parameters can be found that deliver optimal results for specific applications.

For example, many industrial applications investigate numerous methods of the recurrent neural network (RNN) family to understand their analysis quality and the effects of large amounts of data [F1]. A project “AI-based status and fault diagnosis radio systems” has developed a highly specialized processing chain for the automatic detection and prediction of transmission errors in wireless networks using neural networks with stacked long-short-term memory cells (LSTM). A “data-driven localization” project also developed a specific processing chain. In the process, higher-dimensional statistical and spectral features were identified that optimally reflect the signal characteristics of the loss-free signal. For this purpose, CNN and ResNet classification methods [F3, F4] were used to reliably identify signal information from both the dimension-reduced features and from the raw signals and to track it over time. Likewise, different architectures of RNN variants, e.g. gated recurrent units (GRU), LSTM and bidirectional LSTM, were evaluated in the project “Intelligent Power Electronics” to estimate whether current-voltage and stability signatures can be identified and detected using time-series-based methods, and whether these methods can ultimately be reused for prediction.

Sometimes the optimal architecture needs a simulation. To find suitable and efficient neural networks to search for game scenes in soccer games, a simulator was used in the application “Efficient search and representation of tracking data”, which delivers time series data through suitable preprocessing to derive an optimal architecture on the basis of this. In addition, Deep Siamese Networks offer a highly scalable approach to searching for disordered movement trajectories in game scenes [F6].

Long-term dependencies in time series. A first study has shown that long-term dependencies can be recorded with optimal configurations of the embedding of the input sequence with variants of RNNs, but the context vectors fade with increasing complexity and length of the embedding, since a single vector has to record the information of all cells in a multiplicative manner. Therefore, in A3 “data-driven localization”, it is examined how multiple context vectors can avoid the fading of the essential context information and residual architectures the disappearance of gradients in very deep structures through attention cells. Furthermore, the effect of additive computing steps with multiple context vectors is examined [F1].

II. Dealing with anomalies in time series data

Supervised methods. Anomalies in time series often lead to undesirable behavior and often cannot be described in an error model because, for example, they occur very rarely or have time-variant behavior. Therefore, an application examines the predication of current, voltage and stability signatures for power electronics to identify failures of partial areas up to entire circuits or significant deviations in characteristic curves at an early stage. However, continuous deviations, such as the aging process of components, are difficult to record. In supervised methods, anomaly detection is often presented as an end-to-end classification problem, where data from both classes (normal and abnormal) must be present to enable a distinction. Deep neural networks such as LSTM and temporal CNNs are used to enable intelligent distinctions between normal and abnormal data points. Another application has identified anomalies in channel information of radio systems using CNNs to differentiate between signals with direct visual contact and without direct visual contact to strengthen a downstream localization component. However, monitored methods are only suitable for time-invariant data with many abnormal data points to avoid overfitting [F3].

Figure 3: Exemplary processing chain for anomaly detection in time series data.

Semi-supervised and unsupervised methods. In contrast to monitored methods, these methods also allow time-varying abnormal behavior or rare abnormal events to be detected. The behavior of the normal data is modeled as well as possible to be able to detect long-term and rare deviations. Cluster-based algorithms use selected characteristics of the input data to narrow down groups of normal data points to identify deviating data points with large distances in the clusters. In the area of ​​deep neural networks, generative models such as Variational Autoencoder (VAE) are often used to identify abnormal data. Neural networks are used to learn the complex distributions of normal data and to be able to identify abnormal data points outside the distribution. Application “Data-Driven Localization” has researched a VAE model to identify and evaluate visual and non-visual connections in propagation paths between transmitter and receiver units. The temporal relationships are controlled by embedding the input sequence [F3]. The VAE models were optimized in a large-scale parameter study [F2].

III. Generalizability and estimation of uncertainty in sequence-based forecasting methods

Generalizability. Methods that work with time series data typically suffer from the fact that the supplying systems, e.g. sensors or the system environment, change, and thus the characteristic properties of the signals change in the short and long term. An already trained model does not provide a reliable forecast in the event of such a change in the time series data and must be adapted to the new signal properties. This adaptation can take place in different ways: In the so-called domain adaptation, the source and target domains all have the same feature space, but different distributions, for example if the properties of sensor measurements change.

In contrast to this, transfer learning includes cases in which the feature space of the target domain differs from the feature space of the source, for example when input streams are detected by different types of sensors. A3 “Data-Driven Localization” examines how neural networks that have been trained on a radio propagation model of one room can be transferred to the propagation model of another room. Since primarily the properties of the signals change, research is being carried out into how the distributions of the two different propagation models can be approximated [F3]. Close links are created here with the competence pillars of the Few Label Learning and Active Learning.

Figure 4: Exemplary processing chain of a generalization using transfer learning of different time series data.

Uncertainty estimation. “What the neural network predicts is not necessarily what you see!” Means the predictions of a neural network are subject to an uncertainty that is difficult to interpret. To evaluate the uncertainty of a trained model, Monte Carlo Dropout interprets the regular dropout as a Bayesian approximation to a well-known probability model: the Gaussian process. Each so-called dropout mask can be interpreted as a separate network (with different neurons that have been eliminated) and treated as Monte Carlo samples from the space of all available models. This provides mathematical reasons to better interpret the uncertainty of the model and thus improve the prediction accuracy and robustness. Dropout is used in both training and testing. Many predictions are made, one from each model, to analyze the average prediction quality and its distribution. Since this approach does not require any changes in the model architecture, the method can even be applied to a model that has already been trained. However, dropout often leads to unstable training processes. Therefore, in application “data-driven localization”, it is examined how the optimal compromise between dropout, stability of the training and accuracy of the prediction can be achieved. Another application goes one step further and examines stochastic weight averaging with Gaussian processes, which can be implemented directly in the training process [F1], leads to more stable training processes and has lower computing costs.

IV. Coupling of model and data-driven processes

Learning the hidden latent structure of sequences is a well-studied topic in machine learning. However, popular and successful autoregressive models such as multi-layer perceptrons and RNNs fail to capture the true latent structure embedded in the sequence when the sequence is generated from a multimodal distribution. Therefore, in application “Data-driven localization”, it is examined whether an autoregressive Conditional Variational Autoencoder (CVAE) can be used to learn Markov sequences of any order and whether it can be extended to a general Bayesian network. First experiments with synthetic sequences from trajectories show that the model successfully learns to generate Markov sequences with multimodal state transition probabilities [F5].

Figure 5: Architecture of a deep Bayes method, using the example of long-short-term memory cells deeply coupled in a Kalman filter (process noise Q, measurement noise R and movement model f; the black box model transforms the dimensions of the raw time series data if necessary).

Another approach is to combine model-controlled filter methods such as Kalman filters (KF) and data-driven RNN variants (so-called Deep Bayes), since both are limited in their functionality when viewed individually, as they either generalize too much or not at all, a lot of effort, time and require resources. Using several LSTM modules, the times of the internal forecast, their uncertainty and the uncertainty of the observed measurement are estimated at each point in time. Similar to the original KF, an internal state is retained, which strengthens the robustness of the system against missing or implausible time series data. First experiments of the application “data-driven localization” have shown that an LSTM-KF model, as an example of an RNN-supported Bayesian method, is more robust against noisy labels than independent variants of KF and LSTM [F1] (to be submitted).



[F1] Feigl T., Porada A., Kram S., Stahlke M., Mutschler C., (2021). A Deep Bayes Localization Filter on Noisy Labels. (IPIN). (work-in-progress).
[F2] Alstidl T., Hermann O., Kram S., Feigl T., Mutschler C. (2021). Accuracy-Aware Compression of Channel Impulse Responses using Deep Learning. IPIN. (work-in-progress).
[F3] Stahlke M., Feigl T., Kram S., Mutschler C. (2021). Channel quality estimation using a Variational Autoencoder. IPIN.(work-in-progress).
[F4] Kram S., Feigl T., Eberlein E., Franke N., Alawieh M., Nowak T., Mutschler C. (2021). On the Challenges and Opportunities of Positioning under Multipath Signal Propagation. IPIN. (work-in-progress).
[F5] Siddiqui H. R., Mutschler C. (2021). Learning Markov Chains Using Generative Models. AAAI. (work-in-progress).


[F6] Löffler C., Witt N., Mutschler C. (2020). Deep Siamese Metric Learning: A Highly Scalable Approach to Search Unordered Sets of Trajectories. J. ACM TIST (special issue).
[F7] Löffler C., Witt N., Mutschler C. (2020). Scene similarity. Conf. on Sport Info Tech
[1] Feigl T., Kram S., Eberlein E., Mutschler C. (2020). Robust ToA-Estimation using Convolutional Neural Networks on Randomized Channel Models. IEEE Transaction on Signal Processing (submitted).
[2] Feigl T., Gruner L., Mutschler C., and Roth D. (2020). Real-Time Gait Reconstruction For Virtual Reality Using a Single Sensor. ISMAR.
[3] Redzepagic A., Löffler C., Feigl T., Mutschler C. (2020). A Sense of Quality for Augmented Reality Assisted Process Guidance. ISMAR.
[4] Feigl T., Kram S., Woller P., Siddiqui RH., Philippsen M., Mutschler C. (2020). RNN-aided Human Velocity Estimation from a Single IMU. Sensors
[5] Ott F., Feigl T., Löffler C., Mutschler C. (2020). ViPR: Visual-Odometry-aided Pose Regression for 6DoF Camera Localization. CVPR.
[6] Feigl T., Porada A., Steiner S., Löffler C., Mutschler C., Philippsen M. (2020). Localization Limitations of ARCore, ARKit, and Hololens in Dynamic Large-Scale Industry Environments. GRAPP.


[7] Ott F., Feigl T., Löffler C., Mutschler C. (2019). Visual-Odometry-aided Pose Regression for 6DoF Camera Localization. arxiv/cs.CV.
[8] Kram S., Stahlke M., Feigl T., Seitz J., Thielecke J. (2019). UWB Channel Impulse Responses for Positioning in Complex Environments: A Detailed Feature Analysis. Sensors.

[9] Feigl T., Kram S., Woller P., Siddiqui R. H., Philippsen M., Mutschler C. (2019). A Bidirectional LSTM for Estimating Dynamic Human Velocities from a Single IMU. IPIN.
[10] Feigl T., Roth D., Gradl S., Wirth M., Latoschik M. E., Eskofier B., Philippsen M., Mutschler C. (2019). Sick Moves! Motion Parameters as Indicators of Simulator Sickness. TVCG.
[11] Roth D., Westermeier F., Brübach L., Feigl T., Schell C., Latoschik M. E. (2019). Brain 2 Communicate: EEG-based Affect Recognition to Augment Virtual Social Interactions. GI VR/AR.
[12] Roth D., Brübach L., Westermeier F., Schell C., Feigl T., Latoschik M. E. (2019). A Social Interaction Interface Supporting Affective Augmentation Based on Neuronal Data. SUI.


[13] Feigl T., Nowak T., Philippsen M., Edelhäußer T., Mutschler C. (2018). Recurrent Neural Networks on Drifting Time-of-Flight Measurements. IPIN.
[14] Feigl T., Mutschler C., Philippsen M. (2018). Supervised Learning for Yaw Orientation Estimation. IPIN
[15] Feigl T., Mutschler C., Philippsen M. (2018). Head-to-Body-Pose Classification in No-Pose VR Tracking Systems. IEEE VR.