Senthilkumar Gopal

Musings of a machine learning researcher, engineer and leader

Feature data creation for Time Series


Timeseries data is a list of observations in a constant interval. This post gives a quick review of how to convert the list of observations into features and labels to build a ML model to help predict the next observation.

Timeseries feature data extraction

For a time series the feature set is effectively a number of values in the list, with the label being the next value. A range of the observations will be used as the feature set, called the window size, where by we would sliced a window of data and training an ML model to predict the next observation. For a time series data of 10 observations, we can expand the data set using windowing, where the size of the window determines the shift by each iteration. This splits the data into features and labels and the last item of the list being the label for the feature. We can also shuffle and batch the data using PyTorch DataLoader.

import torch
from torch.utils.data import TensorDataset, DataLoader

# Generate a PyTorch tensor with numbers 0 to 9
data = torch.arange(10)

# Define window size and shift
window_size = 5
shift = 1

# Window the data and drop remainder
windows = [data[i:i + window_size] for i in range(0, len(data) - window_size + 1, shift)]

# Flatten the windows
flat_windows = [window.flatten() for window in windows]

# Create tuples with features (first four elements of the window) and labels (last element)
features = [window[:-1] for window in flat_windows]
labels = [window[-1] for window in flat_windows]

# Convert features and labels to PyTorch tensors
features_tensor = torch.stack(features)
labels_tensor = torch.tensor(labels)

# Create a PyTorch dataset
dataset = TensorDataset(features_tensor, labels_tensor)

# Shuffle the dataset
shuffle_indices = torch.randperm(len(dataset))
dataset = TensorDataset(features_tensor[shuffle_indices], labels_tensor[shuffle_indices])

# Create a PyTorch DataLoader
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Print the results
for x, y in dataloader:
    print("x = ", x.numpy())
    print("y = ", y.numpy())
    print()

The output for reference

x =  [[5 6 7 8]
 [0 1 2 3]]
y =  [9 4]

x =  [[1 2 3 4]
 [2 3 4 5]]
y =  [5 6]

x =  [[4 5 6 7]
 [3 4 5 6]]
y =  [8 7]