Contents
TogglePyTimeTK the library for time series analysis
In this tutorial, we will show how to use the PyTimeTK python library to analyze time series data.
PyTimeTK, an optimized library
Time series analysis is fundamental to many fields, from business forecasting to scientific research. Although the Python ecosystem offers tools like pandas, they can sometimes be verbose and not optimized for all operations, especially for complex time-based aggregations and visualizations.
PytimeTK offers a blend of ease of use and computational efficiency and greatly simplifies the process of manipulating and visualizing time series. By leveraging the Polars backend, you can benefit from speed improvements ranging from 3X to a whopping 3500X, which is a plus.
Prerequisites — Make sure you have Python 3.9 or later installed on your system.
Install the latest stable version of pytimetk using pip: the latest version is 0.2.0
pip install pytimetk
Alternatively, you can install the development version:
pip install git+https://github.com/business-science/pytimetk.git
Data Management / Data Wrangling
summum_by_time — Summarize a DataFrame or Groupby object by time. This function is useful for performing aggregations on a dataset with a groupby object over a temporal frequency. We can use several aggregation functions such as (Sum, Median, Minimum, Maximum, Standard deviation, Variance, First value in group, Last value in group, Number of values, Number of unique values, Correlation between values.
Custom lambda aggregation functions can also be used on a data frame with varying frequencies like (such as “D” for daily or “MS” for start of month, – S: second frequency – min: minute frequency – H: hourly frequency – D: daily frequency – W: weekly – M: end-of-month frequency – MS: start-of-month frequency – Q: end-of-quarter frequency – QS: start-of-quarter frequency – Y: frequency end of year – YS: start of year frequency)
df.groupby('category_1').summarize_by_time(
date_column = 'order_date',
value_column = ['total_price', 'quantity'],
freq = 'MS',
agg_func = ['sum', 'mean', 'median','min',
('q25', lambda x: np.quantile(x, 0.25)),
('q75', lambda x: np.quantile(x, 0.75)),
'max',('tidy',lambda x:x.max() - x.min())],
wide_format = False,
engine = 'panda')
pad_by_time — Make irregular time series regular by padding them with missing data. For example, if you have monthly historical data that is missing a few months, use pad_by_time and pad the missing data with zero values.
df.groupby('category_1').pad_by_time(
date_column = 'order_date',
freq = 'W',
end_date = df.order_date.max())
future_frame — This function is used to extend a dataframe or Groupby object with future dates based on a specified length, optionally linking the original data. This is mainly useful for preparing a future dataset that can be used for predictions.
df.groupby('category_2').future_frame(
date_column = 'order_date',
length_out = 12 # Extend data for future 3 months)
Data visualization
plot_timeseries() — Creates time series plots using different plotting engines such as Plotnine, Matplotlib and Plotly.
- Generates interactive plots (ideal for exploration and streamlined/brilliant applications)
- Consolidates over 20 lines of plotnine/matpotlib and plotly code
- Scales well to many time series
- Can be converted from interactive plotly to static plotnine/matplotlib plots by changing the engine from plotly to plotnine or Matplotlib.
- Groups can be added using pandas groupby()
df.groupby('id').plot_timeseries('date', 'value',
facet_ncol = 2, # 2-column faceted plot
facet_scales = "free_y",
smooth_frac = 0.2, # Apply smoothing to the time series data
smooth_size = 2.0,
y_intercept = None,
x_axis_date_labels = “%Y”,
engine = 'plotly',
width = 600,
height = 500)
Anomaly detection
Anomaly detection in time series analysis is a crucial process for identifying unusual patterns that deviate from expected behavior. These anomalies can signify critical, often unanticipated, events in time series data. Effective anomaly detection helps maintain data quality and reliability, ensuring accurate forecasting and decision-making.
The challenge is to distinguish between true anomalies and natural fluctuations, which requires sophisticated analytical techniques and a deep understanding of the underlying time series patterns. Therefore, anomaly detection is an essential part of time series analysis, supporting the proactive management of risks and opportunities in dynamic environments.
Pytimetk uses the following methods to determine anomalies in time series data:
1- Decomposition of time series:
- The first step consists of decomposing the time series into several components. Typically, this includes trend, seasonality, and remainder (or residual) components.
- The trend represents the underlying pattern or direction of the data over time. Seasonality captures recurring patterns or cycles over a specific time period, e.g. daily, weekly, monthly, etc.
- The remainder (or residual) is what remains after the trend and seasonal components have been removed from the original time series.
2- Generate leftovers:
- After decomposition, the remaining component is extracted. This component reflects the part of the time series that cannot be explained by the trend and seasonal components.
- The idea is that while trend and seasonality represent predictable and therefore "normal" patterns, the rest is where anomalies are most likely to manifest.
There are 2 common seasonal decomposition techniques; STL and Twitter;
- STL (Seasonal and Trend Decomposition) is a versatile and robust method for decomposing time series. STL works very well in circumstances where a long-term trend is present. The Loess algorithm generally does a very good job of detecting the trend. However, in cases where the seasonal component is more dominant than the trend, Twitter tends to perform better.
- The Twitter method is a decomposition method similar to that used in Twitter's AnomalyDetection package. The Twitter method works identically to STL to remove the seasonal component. The main difference is detrending, which is done by removing the median from the data rather than adjusting a smoothing. The median works well when a long-term trend is less dominant than the short-term seasonal component. Indeed, smoothing tends to overfit anomalies.
Anamolize — Detects anomalies in time series data, either for a single time series or for multiple time series grouped by a specific column. This feature uses parallel processing to speed up calculation for large data sets with many time series groups. Parallel processing incurs overhead and may not be faster on small data sets. To use parallel processing, set threads = -1 to use all available processors.
# Anomalize the data
anomalize_df = tk.anomalize(
df, "date", "value",method = "twitter",
iqr_alpha = 0.10, # To determine the threshold for detecting outliers.
clean_alpha = 0.75, # To determine the threshold for cleaning the outliers
clean = "min_max", # specifies the method used to clean the anomalies
verbose = True, # To display additional info and progress during the execution)
This function returns a data frame with the columns mentioned below where recomposed_l1 and l2 are the lower and upper level boundaries of the recomposed time series.
Plot_Anomalies — Creates an anomaly plot in time series data using Plotly, Matplotlib.
anomalize_df.plot_ anomalies(date_column = " date ",engine = “plotly”))
Plot_Anomalies_decomp — This function retrieves data from the anomalize() function and returns a plot of the anomaly decomposition. It returns a graph of observed, seasonal and remaining trends. We can also use Groupby to plot it by category.
anomalize_df.plot_ anomalies_decomp(" date ", engine = 'plotly')
Plot_Anomalies_cleaned — This function retrieves the data from the anomalize() function and returns a plot of the cleaned anomalies, which means we can visualize the data before and after the anomalies are removed.
anomalize_df.plot_ anomalies_cleaned(" date ")
Feature Engineering
Adding functionality to Time Series DataFrames (Augmentation) using the below functions from the pytimetk package
- augment_timeseries_signature - Takes a DataFrame and a date column as input and returns the original df with the 29 different date and time based features added as new columns with the feature name based on date_column.
- Augment_holiday_signature — Expands 4 different holiday features from a single date/time for 137 countries.
- Augment_lags — Adds lags to a Pandas DataFrame or DataFrameGroupBy object.
- Augment_leads — Adds leads to a Pandas DataFrame or DataFrameGroupBy object.
- Augment_diffs - Adds diffs to a Pandas DataFrame or DataFrameGroupBy object.
- Augment_rolling — Apply one or more scrolling functions based on series and window sizes to one or more columns of a DataFrame.
- Augment_rolling_apply — Apply one or more rolling functions and window sizes based on a DataFrame
- augment_expanding -Apply one or more series-based expansion functions to one or more columns of a DataFrame.
- augment_expanding_apply — Apply one or more DataFrame-based expansion functions to one or more columns of a DataFrame.
- augment_fourier – Adds Fourier transforms to a Pandas DataFrame or DataFrameGroupBy object.
- Augment_hilbert — Applies the Hilbert transformation to the specified columns of a DataFrame.
- Augment_wavelet — Applies the Wavely transformation to the specified columns of a DataFrame.
Comparison with Pandas
As the table shows, pytimetk is not just about speed; it also simplifies your code base. For example, summary_by_time() converts a 6-line double for loop routine in pandas into a concise 2-line operation. And with the Polars engine, get results 13.4 times faster than pandas!
Likewise, plot_timeseries() greatly streamlines the plotting process, encapsulating what would typically require 16 lines of matplotlib code into a simple 2-line command in pytimetk, without sacrificing customization or quality. And with the plotly and plotnine engines, you can create interactive plots and stunning static visualizations with just a few lines of code.
For calendar features, pytimetk offers augment_timeseries_signature() which reduces 30+ lines of pandas dt extracts. For dynamic features, pytimetk offers augment_rolling(), which is 10 to 3500 times faster than pandas. It also offers pad_by_time() to fill in gaps in your time series data, and anomalize() to detect and correct anomalies in your time series data.