Mastering Prompt Engineering for Time Series Analysis

Designing Reliable Prompts for Analyzing Time Series Data

Sadrach Pierre, Ph.D.
DataFabrica

--

Image by Aphiwat chuangchoem on Pexels

Time series analysis is the task of analyzing data that are ordered by a timestamp or date. It can include generating visualizations, performing seasonal decomposition, building time series forecasting models and more. Time series analysis has a wide range of applications across disparate industries including retail, finance and healthcare. For example, in retail, time series analysis is used for consumer product demand prediction, promotion planning, price optimization and more.

Time series analysis is useful since it can help companies identify underlying trends and patterns in their data. This can help companies gain actionable insights into which factors are driving demand and revenue. For example, knowing which regions have higher demand for a specific set of products can help companies manage inventory in those regions. Further, certain products have seasonal demand and understanding these seasonal patterns can help companies decide when to promote, discount, and manage inventory. All of these applications can help companies optimize revenue, retain customers, and fuel growth.

The rise of Generative AI has led to a wide range of applications for technical tasks such as time series analysis. With Generative AI tools, like ChatGPT, companies can bootstrap their technical approaches to solving business problems with time series analysis. This includes understanding and exploring options for visualizing time series data, performing seasonality and trend analysis and time series forecasting.

In this post, we will develop a series of engineered prompts that can help bootstrap our approach to time series analysis. For our purposes we will be working with the Synthetic Credit Card Transaction data available on DataFabrica. The data contains synthetic credit card transaction amounts, credit card information, transaction IDs and more. The free tier is free to download, modify, and share under the Apache 2.0 license.

A complete list of engineered prompts, outputs and an accompanying Jupyter notebook, containing many more examples than what is discussed in this post, can be found here.

Getting Started

To start, let’s navigate to an IDE of your choice. I personally use Jupyter Notebooks, but feel free to choose whichever IDE you are comfortable with. We will proceed by reading our credit card transaction data into a Pandas dataframe and displaying the first five rows of data:

df = pd.read_csv("synthetic_transaction_data_Dining_SMALL_w_items.csv")
print(df.head())
Screenshot taken by Author
Screenshot taken by Author

We see in our data that we have transaction_date, merchant_name and transaction_amount. We will be using these three columns to generate our time series data. Let’s filter the dataframe to include these three columns and store the result in a new variable called ts_df:

ts_df = df[['transaction_date', 'transaction_amount', 'merchant_name']]

Next we will convert the transaction_date to a Pandas datetime:

ts_df['transaction_date'] = pd.to_datetime(ts_df['transaction_date'])

Next lets display the first five rows of our ts_df dataframe:

ts_df.head()
Screenshot taken by Author

Next let’s consider a single merchant for simplicity. Let’s filter our data to only include sales for ‘Dunkin’ and perform a groupby on the month_year which we will engineer using transaction_date :

ts_df = ts_df[ts_df['merchant_name'] == "Dunkin'"]
ts_df['month'] = ts_df['transaction_date'].dt.month
ts_df = ts_df[ts_df['month'] <= 12]
ts_df = ts_df[ts_df['month'] >= 1]
ts_df['month'] = ts_df['month'].astype(str)

ts_df['year'] = ts_df['transaction_date'].dt.year
ts_df['year'] = ts_df['year'].astype(str)

ts_df['month_year'] = ts_df['year'] + "-"+ ts_df['month']
ts_df['month_year'] = pd.to_datetime(ts_df['month_year'])

ts_df = ts_df.groupby('month_year')['transaction_amount'].sum().reset_index()

Now we’ve prepared our time series data for analysis. Next we will navigate to chatgpt.openai.com and log in. If you don’t already have an account you can easily sign up using your Gmail account:

Screenshot taken by Author

Upon logging in you should see the following chat window:

Screenshot taken by Author

Next we will walk through how to boost strap time series data visualization using ChatGPT prompt engineering.

Time Series Visualization

Depending on your comfort level with data science and python, the types of tasks and consequently the complexity of prompts can significantly vary. For each time series analysis task I will iteratively increase the complexity of the prompt and task, which should illustrate effective prompt engineering.

Let’s start by asking ChatGPT to generate a time series line plot:

Visualization Prompt 1:

Write python code that generates a line plot given a dataframe with 
month_year and transaction_amount.

Output:

import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame
data = {'month_year': ['2023-01', '2023-02', '2023-03', '2023-04'],
'transaction_amount': [100, 150, 120, 200]}

df = pd.DataFrame(data)

# Assuming 'month_year' is in datetime format
# If not, you can convert it using df['month_year'] = pd.to_datetime(df['month_year'])

# Sorting the DataFrame by month_year
df['month_year'] = pd.to_datetime(df['month_year'])
df = df.sort_values(by='month_year')

# Plotting
plt.figure(figsize=(10, 6))
plt.plot(df['month_year'], df['transaction_amount'], marker='o', linestyle='-', color='b')

# Adding labels and title
plt.xlabel('Month-Year')
plt.ylabel('Transaction Amount')
plt.title('Transaction Amount Over Time')

# Rotating x-axis labels for better readability
plt.xticks(rotation=45)

# Display the plot
plt.show()

We will notice that the code is using the variable df which ChatGPT has synthetically created. We want this code to reference our dataframe, ts_df. We can update our prompt as follows:

Visualization Prompt 2:

Write python code that generates a line plot given a previously 
dataframe called ts_df with month_year and transaction_amount.

Output:

'''
Prompt:

Write python code that generates a line plot given a previously
dataframe called ts_df with transaction_date and transaction_amount.

'''

import matplotlib.pyplot as plt

# Assuming 'month_year' is in datetime format
# If not, you can convert it using ts_df['month_year'] = pd.to_datetime(ts_df['month_year'])

# Sorting the DataFrame by month_year
ts_df = ts_df.sort_values(by='month_year')

# Plotting
plt.figure(figsize=(10, 6))
plt.plot(ts_df['month_year'], ts_df['transaction_amount'], marker='o', linestyle='-', color='b')

# Adding labels and title
plt.xlabel('Month-Year')
plt.ylabel('Transaction Amount')
plt.title('Transaction Amount Over Time')

# Rotating x-axis labels for better readability
plt.xticks(rotation=45)

# Display the plot
plt.show()

When run, this results in the following plot:

Screenshot taken by Author

In my experience ChatGPT generally creates plots using Matplotlib by default unless you specify a specific type of visualization that can only be done by a specific library or you specify the library you’d like to use directly. For example, we can specify that we want to generate a line plot using the Plotly library.

Visualization Prompt 3:

Write python code that generates a line plot, using Plotly, given a previously 
dataframe called ts_df with month_year and transaction_amount.

Output:

import plotly.express as px
import pandas as pd

# Assuming 'month_year' is in datetime format
# If not, you can convert it using ts_df['month_year'] = pd.to_datetime(ts_df['month_year'])

# Sorting the DataFrame by month_year
ts_df = ts_df.sort_values(by='month_year')

# Plotting with Plotly Express
fig = px.line(ts_df, x='month_year', y='transaction_amount', markers=True, line_shape='linear', title='Transaction Amount Over Time')
fig.update_layout(xaxis_title='Month-Year', yaxis_title='Transaction Amount')

# Show the plot
fig.show()
Screenshot taken by Author

Personally, I am a fan of Plotly as it allows you to hoover over data points in the plot and see specific values.

Key Takeaways:

  1. Specify variables: If unspecified, ChatGPT will either define the variable itself and/or use undefined variable names.
  2. Specify the columns: Similar to variables, if the columns that you would like to perform some logic on isn’t specified, ChatGPT will typically reference a generic column name which is probably not present in your data.
  3. Specify libraries: In our final prompt example, we specified the library Plotly, but if you prefer Seaborn or any other library you should be explicit in your prompt. Otherwise, ChatGPt will likely use a default option for a library.

Time Series Forecasting

ChatGPT prompt engineering can also be very helpful with building time series forecasting models. Before we generate prompts, let’s filter out the data points after Jan 1, 2023:

ts_df = ts_df[ts_df['month_year'] < '01-01-2023']

Now our plot is as follows:

Screenshot taken by Author

Now let’s write a prompt that asks ChatGPT to generate an auto ARIMA time series forecasting model given the dataframe ts_df . We will specify that the data contains a column called month_year and transaction_amount:

Time Series Forecasting Prompt 1:

Write python code that builds an automarima time series model given 
the dataframe ts_df, which contains columns month_year and transaction_amount.
Make sure to split the data for training and testing.

Output:

import pandas as pd
from pmdarima import auto_arima
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Assuming 'month_year' is in datetime format
# If not, you can convert it using ts_df['month_year'] = pd.to_datetime(ts_df['month_year'])

# Sorting the DataFrame by 'month_year'
ts_df = ts_df.sort_values(by='month_year')

# Creating a time series with 'month_year' as the index
ts_series = pd.Series(ts_df['transaction_amount'].values, index=ts_df['month_year'])

# Splitting the data into training and testing sets
train_size = int(len(ts_series) * 0.8) # Adjust the split ratio as needed
train, test = ts_series[:train_size], ts_series[train_size:]

# Using auto_arima to find the best ARIMA model on the training set
autoarima_model = auto_arima(train, seasonal=True, stepwise=True, suppress_warnings=True, trace=True)

# Displaying the summary of the best model
print(autoarima_model.summary())

# Making predictions on the test set
forecast_steps = len(test)
forecast, conf_int = autoarima_model.predict(n_periods=forecast_steps, return_conf_int=True)

# Evaluating the model on the test set
mse = mean_squared_error(test, forecast)
print(f'Mean Squared Error on Test Set: {mse}')

# Plotting the actual vs. predicted values
plt.figure(figsize=(10, 6))
plt.plot(train.index, train.values, label='Training Set')
plt.plot(test.index, test.values, label='Test Set', color='blue')
plt.plot(test.index, forecast, label='Predictions', linestyle='--', color='orange')
plt.fill_between(test.index, conf_int[:, 0], conf_int[:, 1], color='orange', alpha=0.2, label='Confidence Intervals')
plt.xlabel('Month-Year')
plt.ylabel('Transaction Amount')
plt.title('Auto-ARIMA Model Predictions')
plt.legend()
plt.show()

This generates the following output:

Screenshot taken by Author
Screenshot taken by Author

We can improve this prompt in many ways. For example, we can specify another, maybe more advanced time series model, we can specify another performance metric, or a specific visualization library. Part of a typical time series analysis workflow includes comparing different models and their performances.

Let’s update our prompt to build and compare ARIMA and SARIMAX time series models.

Time Series Forecasting Prompt 2:

Write python code that builds two time series models and compares performance. 
Build an automarima time series model and a SARIMAX model
given the dataframe ts_df, which contains columns month_year and
transaction_amount.Make sure to split the data for training and testing.
Visualize training, testing and predictions for each model.

Output:

import pandas as pd
import numpy as np
from pmdarima import auto_arima
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Assuming 'month_year' is in datetime format
# If not, you can convert it using ts_df['month_year'] = pd.to_datetime(ts_df['month_year'])


# Sorting the DataFrame by 'month_year'
ts_df = ts_df.sort_values(by='month_year')

# Creating a time series with 'month_year' as the index
ts_series = pd.Series(ts_df['transaction_amount'].values, index=ts_df['month_year'])

# Splitting the data into training and testing sets
train_size = int(len(ts_series) * 0.8) # Adjust the split ratio as needed
train, test = ts_series[:train_size], ts_series[train_size:]

# Auto-ARIMA Model
autoarima_model = auto_arima(train, seasonal=True, stepwise=True, suppress_warnings=True, trace=True)
autoarima_forecast, autoarima_conf_int = autoarima_model.predict(n_periods=len(test), return_conf_int=True)

# SARIMAX Model
order = (1, 1, 1) # Adjust the order based on your analysis
seasonal_order = (1, 1, 1, 12) # Adjust the seasonal order based on your analysis
sarimax_model = SARIMAX(train, order=order, seasonal_order=seasonal_order)
sarimax_results = sarimax_model.fit(disp=False)
sarimax_forecast = sarimax_results.get_forecast(steps=len(test))
sarimax_conf_int = sarimax_forecast.conf_int()

# Evaluating Models
autoarima_mse = mean_squared_error(test, autoarima_forecast)
print(f'Auto-ARIMA Model Mean Squared Error on Test Set: {autoarima_mse}')

sarimax_mse = mean_squared_error(test, sarimax_forecast.predicted_mean)
print(f'SARIMAX Model Mean Squared Error on Test Set: {sarimax_mse}')

# Plotting Results
plt.figure(figsize=(12, 6))

# Auto-ARIMA Plot
plt.subplot(2, 1, 1)
plt.plot(train.index, train.values, label='Training Set')
plt.plot(test.index, test.values, label='Test Set', color='blue')
plt.plot(test.index, autoarima_forecast, label='Auto-ARIMA Predictions', linestyle='--', color='orange')
plt.fill_between(test.index, autoarima_conf_int[:, 0], autoarima_conf_int[:, 1], color='orange', alpha=0.2, label='Confidence Intervals')
plt.title('Auto-ARIMA Model Predictions')
plt.legend()

# SARIMAX Plot
plt.subplot(2, 1, 2)
plt.plot(train.index, train.values, label='Training Set')
plt.plot(test.index, test.values, label='Test Set', color='blue')
plt.plot(test.index, sarimax_forecast.predicted_mean, label='SARIMAX Predictions', linestyle='--', color='green')
plt.fill_between(test.index, sarimax_conf_int.iloc[:, 0], sarimax_conf_int.iloc[:, 1], color='green', alpha=0.2, label='Confidence Intervals')
plt.title('SARIMAX Model Predictions')
plt.legend()

plt.tight_layout()
plt.show()

This generates the following output:

Screenshot taken by Author

The benefit of a SARIMAX model is that it allows you to explicitly incorporate exogenous variables (external factors) in your time series model. Here we do not include exogenous variables explicitly so our SARIMAX model is essentially a SARIMA model. An interesting prompt would be to ask ChatGPT to engineer a feature using one of the columns in our data and use this feature as an exogenous variable. For example, you can include the merchant_state as an exogenous variable in our SARIMAX model. Once a data frame containing the encoded values for merchant state is defined (ie: exog_train_merchant_state,exog_test_merchant_state ) you can specify it as an exogenous variable using the following logic:

sarimax_model = SARIMAX(train, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12), exog=exog_train_merchant_state)
sarimax_results = sarimax_model.fit(disp=False)
sarimax_forecast = sarimax_results.get_forecast(steps=len(test), exog=exog_test_merchant_state) # Use exogenous variables for forecasting
sarimax_conf_int = sarimax_forecast.conf_int()

Key Takeaways:

  1. Specify variable names and column names: Similar to our data visualization example make sure to use specific variable names and column names. This will help ensure that the output code isn’t referencing variables or columns that are not defined.
  2. Ensure data is split for training and testing: An important step in building a forecasting model is splitting the data for training and testing. This will ensure that the model isn’t trained and validated on the same data. This helps ensure that the model will be able to generalize on unseen data.
  3. Specify different model types for comparison: Another important part of the time series analysis workflow is building, validating and comparing different models. This will help ensure that you pick the best model for your use case.

Additional Prompts

  1. Visualization Tasks: We can engineer prompts to gain additional insights in our time series data. This includes visualizations through area charts, moving average plots, and seasonal decomposition.
  2. Time series Forecasting: We can also engineer prompts to build, validate and compare more time series forecasting models with our ARIMA and SARIMAX models. This includes building an exponential smoothing model and Long Short-Term Memory (LSTM) models.

These additional prompts, their outputs, and the accompanying Jupyter notebook is available on DataFabrica. You can access the additional prompts and notebook here.

Conclusions

In this post we discussed how to iteratively improve prompts to help bootstrap time series analysis tasks. First we discussed how to define a prompt to generate visualizations of time series data. We then improved this prompt by adding context that specifies the correct names for variables and columns. Further, we updated the prompt to generate the time series visualization using an alternative library, Plotly, instead of the default choice of Matplotlib. We then considered the task of prompt engineering for generating time series forecasting models. We started by asking ChatGPT to build an ARIMA time series forecasting model. We then updated the prompt to generate code that allows us to build and compare an ARIMA and a SARIMAX model.

While we walked through a few examples of prompt engineering for time series analysis, there are many other techniques we could have covered. For visualization these include moving average analysis, area plots, seasonal decomposition and more. For time series forecasting, we could’ve also including more time series forecasting models such as exponential smoothing models and Long Short-Term Memory (LSTM) models. You can explore the prompts for generating code for these types of analyses here.

--

--

Sadrach Pierre, Ph.D.
DataFabrica

Writer for Built In & Towards Data Science. Cornell University Ph. D. in Chemical Physics.