Python: Constructing Time Series Sequence Samples Dataset

This post construct the multivariate time series data into sequence samples dataset for RNNs, LSTMs, CNNs, and similar models in Keras or Tensorflow.



Time Series Sequence Samples Dataset



Sequence-based models such LSTM requires the 3D dataset structure: (batch, timesteps, features). The term 'batch' specifically indicates the subset of samples included in a mini-batch during model training.

Timesteps denote the historical sequence of data instances or temporal lags. Incorporating this temporal dimension (timesteps) constructs a three-dimensional dataset, enabling the effective capture of the data's sequential nature.


Python code


I've made a general function for this purpose:

f_make_seq_data_from_matrix(data, ts_list, fh_list) .

The 'data' is a numpy matrix which contains a multivariate time series. 'ts_list' is a list of timesteps, which doesn't have to be consecutive. On the other hand, 'fh_list' refers to a list of forecasting horizons, capable of representing single or multi-step forecasts and also need not be consecutive.

This function, such as it is, is capable of handling distributed lags, as well as one-step or multi-step forecasting, and thus, I believe it is helpful for general purposes.


import numpy as np
 
def f_make_seq_data_from_matrix(data, ts_list, fh_list):
    co_list = ts_list+fh_list
    coseq_range = range(min(co_list),max(co_list)+1)
    tsseq_range = range(min(ts_list),max(ts_list)+1)
    fhseq_range = range(min(fh_list),max(fh_list)+1)
    tssel_list = [i - min(ts_list) for i in ts_list]
    fhsel_list = [i - min(fh_list) for i in fh_list]
    
    is_seq = []; ot_seq = []; obs = data.shape[0]
    for i in range(obs - len(coseq_range) + 1):
        dal = data[i:i + len(coseq_range)]
        din = dal[:len(tsseq_range)]
        dot = dal[-len(fhseq_range):]
        is_seq.append(din[tssel_list])
        ot_seq.append(dot[fhsel_list])
    return np.array(is_seq), np.array(ot_seq)
 
cs


The elements of two lists are determined based on the time 't'. For example, -2, -1, 0, 1, 2 correspond to t-2, t-1, t, t+1, t+2 respectively.


Case 1: Forecasting time t+1 utilizing information from time t

This resembles a typical example, akin to an AR(1) model. Achieving the same result is possible by using ts_list = [-1] and fh_list = [0] since the time lag structure remains consistent.

# Suppose data has 10 observations with 3 features
data = np.random.rand(103)
 
# Generate suitable sequences for CNN, RNN, LSTM, and so on
ts_list = [0# selected timesteps
fh_list = [1# selected forecasting horizons
 
# generate sequences dataset for Keras
X, Y = f_make_seq_data_from_matrix(data, ts_list, fh_list)
 
print("\nTimesteps:", ts_list, ", forecast horizons:", fh_list)
print("\nData\n", data, "\n Shape of data:", data.shape)
print("\nX\n", X, "\n Shape of X:", X.shape)
print("\nY\n", Y, "\n Shape of Y:", Y.shape)
 
cs


Timesteps: [0] , forecast horizons: [1]
 
Data
 [[0.58708034 0.88707951 0.25878656]
 [0.52696273 0.13857786 0.50993527]
 [0.53533872 0.45365456 0.89658186]
 [0.54978604 0.91198371 0.25040483]
 [0.36520302 0.76098129 0.5341683 ]
 [0.46726791 0.82170191 0.52046577]
 [0.84807446 0.70375552 0.31805087]
 [0.3812772  0.31083093 0.33218005]
 [0.49522332 0.4586895  0.61974004]
 [0.88130502 0.47469752 0.50149153]] 
 Shape of data: (103)
 
X
 [[[0.58708034 0.88707951 0.25878656]]
 
 [[0.52696273 0.13857786 0.50993527]]
 
 [[0.53533872 0.45365456 0.89658186]]
 
 [[0.54978604 0.91198371 0.25040483]]
 
 [[0.36520302 0.76098129 0.5341683 ]]
 
 [[0.46726791 0.82170191 0.52046577]]
 
 [[0.84807446 0.70375552 0.31805087]]
 
 [[0.3812772  0.31083093 0.33218005]]
 
 [[0.49522332 0.4586895  0.61974004]]] 
 Shape of X: (913)
 
Y
 [[[0.52696273 0.13857786 0.50993527]]
 
 [[0.53533872 0.45365456 0.89658186]]
 
 [[0.54978604 0.91198371 0.25040483]]
 
 [[0.36520302 0.76098129 0.5341683 ]]
 
 [[0.46726791 0.82170191 0.52046577]]
 
 [[0.84807446 0.70375552 0.31805087]]
 
 [[0.3812772  0.31083093 0.33218005]]
 
 [[0.49522332 0.4586895  0.61974004]]
 
 [[0.88130502 0.47469752 0.50149153]]] 
 Shape of Y: (913)
 
cs


Case 2: Forecasting times t+1, t+2, and t+3, utilizing sequential information from times t, t-1, and t-2

This involves a multistep forecasting approach utilizing sequential past information.

# Generate suitable sequences for CNN, RNN, LSTM, and so on
ts_list = [-2,-1,0# selected timesteps
fh_list = [1,2,3]   # selected forecasting horizons
 
# generate sequences dataset for Keras
X, Y = f_make_seq_data_from_matrix(data, ts_list, fh_list)
 
print("\nTimesteps:", ts_list, ", forecast horizons:", fh_list)
print("\nData\n", data, "\n Shape of data:", data.shape)
print("\nX\n", X, "\n Shape of X:", X.shape)
print("\nY\n", Y, "\n Shape of Y:", Y.shape)
 
cs


Timesteps: [-2-10] , forecast horizons: [123]
 
Data
 [[0.58708034 0.88707951 0.25878656]
 [0.52696273 0.13857786 0.50993527]
 [0.53533872 0.45365456 0.89658186]
 [0.54978604 0.91198371 0.25040483]
 [0.36520302 0.76098129 0.5341683 ]
 [0.46726791 0.82170191 0.52046577]
 [0.84807446 0.70375552 0.31805087]
 [0.3812772  0.31083093 0.33218005]
 [0.49522332 0.4586895  0.61974004]
 [0.88130502 0.47469752 0.50149153]] 
 Shape of data: (103)
 
X
 [[[0.58708034 0.88707951 0.25878656]
  [0.52696273 0.13857786 0.50993527]
  [0.53533872 0.45365456 0.89658186]]
 
 [[0.52696273 0.13857786 0.50993527]
  [0.53533872 0.45365456 0.89658186]
  [0.54978604 0.91198371 0.25040483]]
 
 [[0.53533872 0.45365456 0.89658186]
  [0.54978604 0.91198371 0.25040483]
  [0.36520302 0.76098129 0.5341683 ]]
 
 [[0.54978604 0.91198371 0.25040483]
  [0.36520302 0.76098129 0.5341683 ]
  [0.46726791 0.82170191 0.52046577]]
 
 [[0.36520302 0.76098129 0.5341683 ]
  [0.46726791 0.82170191 0.52046577]
  [0.84807446 0.70375552 0.31805087]]] 
 Shape of X: (533)
 
Y
 [[[0.54978604 0.91198371 0.25040483]
  [0.36520302 0.76098129 0.5341683 ]
  [0.46726791 0.82170191 0.52046577]]
 
 [[0.36520302 0.76098129 0.5341683 ]
  [0.46726791 0.82170191 0.52046577]
  [0.84807446 0.70375552 0.31805087]]
 
 [[0.46726791 0.82170191 0.52046577]
  [0.84807446 0.70375552 0.31805087]
  [0.3812772  0.31083093 0.33218005]]
 
 [[0.84807446 0.70375552 0.31805087]
  [0.3812772  0.31083093 0.33218005]
  [0.49522332 0.4586895  0.61974004]]
 
 [[0.3812772  0.31083093 0.33218005]
  [0.49522332 0.4586895  0.61974004]
  [0.88130502 0.47469752 0.50149153]]] 
 Shape of Y: (533)
 
cs


Case 3: Forecasting at times t+3 and t+5 using nonconsecutive multistep forecasting with time t and t-2 as distributed lag information

This exercise isn't realistic; however, it's used to demonstrate the generalized characteristics of the function.

# Generate suitable sequences for CNN, RNN, LSTM, and so on
ts_list = [-2,0# selected timesteps
fh_list = [3,5# selected forecasting horizons
 
# generate sequences dataset for Keras
X, Y = f_make_seq_data_from_matrix(data, ts_list, fh_list)
 
print("\nTimesteps:", ts_list, ", forecast horizons:", fh_list)
print("\nData\n", data, "\n Shape of data:", data.shape)
print("\nX\n", X, "\n Shape of X:", X.shape)
print("\nY\n", Y, "\n Shape of Y:", Y.shape)
 
cs


Timesteps: [-20] , forecast horizons: [35]
 
Data
 [[0.58708034 0.88707951 0.25878656]
 [0.52696273 0.13857786 0.50993527]
 [0.53533872 0.45365456 0.89658186]
 [0.54978604 0.91198371 0.25040483]
 [0.36520302 0.76098129 0.5341683 ]
 [0.46726791 0.82170191 0.52046577]
 [0.84807446 0.70375552 0.31805087]
 [0.3812772  0.31083093 0.33218005]
 [0.49522332 0.4586895  0.61974004]
 [0.88130502 0.47469752 0.50149153]] 
 Shape of data: (103)
 
X
 [[[0.58708034 0.88707951 0.25878656]
  [0.53533872 0.45365456 0.89658186]]
 
 [[0.52696273 0.13857786 0.50993527]
  [0.54978604 0.91198371 0.25040483]]
 
 [[0.53533872 0.45365456 0.89658186]
  [0.36520302 0.76098129 0.5341683 ]]] 
 Shape of X: (323)
 
Y
 [[[0.46726791 0.82170191 0.52046577]
  [0.3812772  0.31083093 0.33218005]]
 
 [[0.84807446 0.70375552 0.31805087]
  [0.49522332 0.4586895  0.61974004]]
 
 [[0.3812772  0.31083093 0.33218005]
  [0.88130502 0.47469752 0.50149153]]] 
 Shape of Y: (323)
 
cs



No comments:

Post a Comment

Tentative Topics (Keeping Track to Avoid Forgetting)

Segmented Nelson-Siegel model
Shifting Endpoints Nelson-Siegel model
Nadaraya-Watson estimator
Locally weighted scatterplot smoothing (LOWESS)
Time-Varying Parameter Vector Autoregressions (TVP-VAR)
Time-varying or Dynamic Copula
Bayesian VAR
Adrian-Crump-Moench (ACM) term premium model
GARCH-EVT-Copula approach