{ "cells": [ { "cell_type": "markdown", "id": "f32100be", "metadata": {}, "source": [ "# Loading Custom Datasets\n", "\n", "This notebook will explain how to load custom datasets saved to CSV files, for either anomaly detection or forecasting." ] }, { "cell_type": "markdown", "id": "91095c9b", "metadata": {}, "source": [ "## Anomaly Detection Datasets\n", "\n", "Let's first look at a synthetic anomaly detection dataset. Note that this section just provides an alternative implementation of the dataset `ts_datasets.anomaly.Synthetic`. We begin by listing all the CSV files in the relevant directory. " ] }, { "cell_type": "code", "execution_count": 1, "id": "b4886d69", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "../data/synthetic_anomaly/horizontal.csv\n", "../data/synthetic_anomaly/horizontal_dip_anomaly.csv\n", "../data/synthetic_anomaly/horizontal_level_anomaly.csv\n", "../data/synthetic_anomaly/horizontal_shock_anomaly.csv\n", "../data/synthetic_anomaly/horizontal_spike_anomaly.csv\n", "../data/synthetic_anomaly/horizontal_trend_anomaly.csv\n", "../data/synthetic_anomaly/seasonal.csv\n", "../data/synthetic_anomaly/seasonal_dip_anomaly.csv\n", "../data/synthetic_anomaly/seasonal_level_anomaly.csv\n", "../data/synthetic_anomaly/seasonal_shock_anomaly.csv\n", "../data/synthetic_anomaly/seasonal_spike_anomaly.csv\n", "../data/synthetic_anomaly/seasonal_trend_anomaly.csv\n", "../data/synthetic_anomaly/upward_downward.csv\n", "../data/synthetic_anomaly/upward_downward_dip_anomaly.csv\n", "../data/synthetic_anomaly/upward_downward_level_anomaly.csv\n", "../data/synthetic_anomaly/upward_downward_shock_anomaly.csv\n", "../data/synthetic_anomaly/upward_downward_spike_anomaly.csv\n", "../data/synthetic_anomaly/upward_downward_trend_anomaly.csv\n" ] } ], "source": [ "import glob\n", "import os\n", "anom_dir = os.path.join(\"..\", \"data\", \"synthetic_anomaly\")\n", "csvs = sorted(glob.glob(f\"{anom_dir}/*.csv\"))\n", "for csv in csvs:\n", " print(csv)" ] }, { "cell_type": "markdown", "id": "9d319673", "metadata": {}, "source": [ "Let's visualize what a couple of these CSVs look like." ] }, { "cell_type": "code", "execution_count": 2, "id": "3151334c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "../data/synthetic_anomaly/horizontal.csv\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timestamphorizontal
001.928031
1300-1.156620
2600-0.390650
39000.400804
41200-0.874490
.........
999529985000.362724
999629988002.657373
999729991001.472341
999829994001.033154
999929997002.950466
\n", "

10000 rows × 2 columns

\n", "
" ], "text/plain": [ " timestamp horizontal\n", "0 0 1.928031\n", "1 300 -1.156620\n", "2 600 -0.390650\n", "3 900 0.400804\n", "4 1200 -0.874490\n", "... ... ...\n", "9995 2998500 0.362724\n", "9996 2998800 2.657373\n", "9997 2999100 1.472341\n", "9998 2999400 1.033154\n", "9999 2999700 2.950466\n", "\n", "[10000 rows x 2 columns]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "../data/synthetic_anomaly/seasonal_level_anomaly.csv\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timestampseasonalanomaly
00-0.5778830.0
13001.0597790.0
26001.1376090.0
39000.7433600.0
412001.9984000.0
............
99952998500-5.3886850.0
99962998800-5.0178280.0
99972999100-4.1967910.0
99982999400-4.2345550.0
99992999700-3.1116850.0
\n", "

10000 rows × 3 columns

\n", "
" ], "text/plain": [ " timestamp seasonal anomaly\n", "0 0 -0.577883 0.0\n", "1 300 1.059779 0.0\n", "2 600 1.137609 0.0\n", "3 900 0.743360 0.0\n", "4 1200 1.998400 0.0\n", "... ... ... ...\n", "9995 2998500 -5.388685 0.0\n", "9996 2998800 -5.017828 0.0\n", "9997 2999100 -4.196791 0.0\n", "9998 2999400 -4.234555 0.0\n", "9999 2999700 -3.111685 0.0\n", "\n", "[10000 rows x 3 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "from IPython.display import display\n", "\n", "for csv in [csvs[0], csvs[8]]:\n", " print(csv)\n", " display(pd.read_csv(csv))" ] }, { "cell_type": "markdown", "id": "4dd0360b", "metadata": {}, "source": [ "Each CSV in the dataset has the following important characteristics:\n", "\n", "- a time column `timestamp` (here, a Unix timestamp expressed in units of seconds);\n", "- a column `anomaly` indicating whether a timestamp is anomalous or not (though this is absent for time series which don't contain any anomalies);\n", "- one or more columns for the actual data values\n", "\n", "We can create a data loader for all the CSV files in this dataset as follows:" ] }, { "cell_type": "code", "execution_count": 3, "id": "69bbc96d", "metadata": {}, "outputs": [], "source": [ "from ts_datasets.anomaly import CustomAnomalyDataset\n", "dataset = CustomAnomalyDataset(\n", " rootdir=anom_dir, # where the data is stored\n", " test_frac=0.75, # use 75% of each time series for testing. \n", " # overridden if the column `trainval` is in the actual CSV.\n", " time_unit=\"s\", # the timestamp column (automatically detected) is in units of seconds\n", " assume_no_anomaly=True # if a CSV doesn't have the \"anomaly\" column, assume it has no anomalies\n", ")" ] }, { "cell_type": "code", "execution_count": 4, "id": "bc2d0778", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 18 time series in this dataset.\n" ] } ], "source": [ "print(f\"There are {len(dataset)} time series in this dataset.\")\n", "time_series, metadata = dataset[3]" ] }, { "cell_type": "markdown", "id": "9d1f1568", "metadata": {}, "source": [ "This particular time series is univariate. Its variable is named \"horizontal\". " ] }, { "cell_type": "code", "execution_count": 5, "id": "c2a87bf9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
horizontal
timestamp
1970-01-01 00:00:001.928031
1970-01-01 00:05:00-1.156620
1970-01-01 00:10:00-0.390650
1970-01-01 00:15:000.400804
1970-01-01 00:20:00-0.874490
......
1970-02-04 16:55:000.362724
1970-02-04 17:00:002.657373
1970-02-04 17:05:001.472341
1970-02-04 17:10:001.033154
1970-02-04 17:15:002.950466
\n", "

10000 rows × 1 columns

\n", "
" ], "text/plain": [ " horizontal\n", "timestamp \n", "1970-01-01 00:00:00 1.928031\n", "1970-01-01 00:05:00 -1.156620\n", "1970-01-01 00:10:00 -0.390650\n", "1970-01-01 00:15:00 0.400804\n", "1970-01-01 00:20:00 -0.874490\n", "... ...\n", "1970-02-04 16:55:00 0.362724\n", "1970-02-04 17:00:00 2.657373\n", "1970-02-04 17:05:00 1.472341\n", "1970-02-04 17:10:00 1.033154\n", "1970-02-04 17:15:00 2.950466\n", "\n", "[10000 rows x 1 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(time_series)" ] }, { "cell_type": "markdown", "id": "ec03b3f1", "metadata": {}, "source": [ "The metadata has the same timestamps as the time series. It contains \"anomaly\" and \"trainval\" columns. These respectively indicate whether each timestamp is anomalous, and whether each timestamp is for training/validation or testing." ] }, { "cell_type": "code", "execution_count": 6, "id": "3e5eb1d4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
anomalytrainval
timestamp
1970-01-01 00:00:00FalseTrue
1970-01-01 00:05:00FalseTrue
1970-01-01 00:10:00FalseTrue
1970-01-01 00:15:00FalseTrue
1970-01-01 00:20:00FalseTrue
.........
1970-02-04 16:55:00FalseFalse
1970-02-04 17:00:00FalseFalse
1970-02-04 17:05:00FalseFalse
1970-02-04 17:10:00FalseFalse
1970-02-04 17:15:00FalseFalse
\n", "

10000 rows × 2 columns

\n", "
" ], "text/plain": [ " anomaly trainval\n", "timestamp \n", "1970-01-01 00:00:00 False True\n", "1970-01-01 00:05:00 False True\n", "1970-01-01 00:10:00 False True\n", "1970-01-01 00:15:00 False True\n", "1970-01-01 00:20:00 False True\n", "... ... ...\n", "1970-02-04 16:55:00 False False\n", "1970-02-04 17:00:00 False False\n", "1970-02-04 17:05:00 False False\n", "1970-02-04 17:10:00 False False\n", "1970-02-04 17:15:00 False False\n", "\n", "[10000 rows x 2 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(metadata)" ] }, { "cell_type": "code", "execution_count": 7, "id": "a911fea8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "75.0% of the time series is for testing.\n", "19.57% of the time series is anomalous.\n" ] } ], "source": [ "print(f\"{100 - metadata.trainval.mean() * 100}% of the time series is for testing.\")\n", "print(f\"{metadata.anomaly.mean() * 100}% of the time series is anomalous.\")" ] }, { "cell_type": "markdown", "id": "63a181a3", "metadata": {}, "source": [ "## General Purpose (Forecasting) Datasets\n", "\n", "Next, let's load a more general-purpose dataset for forecasting. We will use this opportunity to show some of the more advanced features as well. Here, our dataset consists of a single CSV file which contains many multivariate time series. These time series are collected from a large retailer, and each individual time series corresonds to a different department within a different store. Let's have a look at the data." ] }, { "cell_type": "code", "execution_count": 8, "id": "2d0809ae", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StoreDeptDateWeekly_SalesTemperatureFuel_PriceMarkDown1MarkDown2MarkDown3MarkDown4MarkDown5CPIUnemploymentIsHoliday
0112010-02-0524924.5042.312.572NaNNaNNaNNaNNaN211.0963588.106False
1112010-02-1246039.4938.512.548NaNNaNNaNNaNNaN211.2421708.106True
2112010-02-1941595.5539.932.514NaNNaNNaNNaNNaN211.2891438.106False
3112010-02-2619403.5446.632.561NaNNaNNaNNaNNaN211.3196438.106False
4112010-03-0521827.9046.502.625NaNNaNNaNNaNNaN211.3501438.106False
.............................................
28552102012-09-2837104.6779.453.6667106.051.911.651549.103946.03222.6164336.565False
28562102012-10-0536361.2870.273.6176037.76NaN10.043027.373853.40222.8159306.170False
28572102012-10-1235332.3460.973.6012145.50NaN33.31586.8310421.01223.0154266.170False
28582102012-10-1935721.0968.083.5944461.89NaN1.141579.672642.29223.0598086.170False
28592102012-10-2634260.7669.793.5066152.59129.77200.00272.292924.15223.0783376.170False
\n", "

2860 rows × 14 columns

\n", "
" ], "text/plain": [ " Store Dept Date Weekly_Sales Temperature Fuel_Price \\\n", "0 1 1 2010-02-05 24924.50 42.31 2.572 \n", "1 1 1 2010-02-12 46039.49 38.51 2.548 \n", "2 1 1 2010-02-19 41595.55 39.93 2.514 \n", "3 1 1 2010-02-26 19403.54 46.63 2.561 \n", "4 1 1 2010-03-05 21827.90 46.50 2.625 \n", "... ... ... ... ... ... ... \n", "2855 2 10 2012-09-28 37104.67 79.45 3.666 \n", "2856 2 10 2012-10-05 36361.28 70.27 3.617 \n", "2857 2 10 2012-10-12 35332.34 60.97 3.601 \n", "2858 2 10 2012-10-19 35721.09 68.08 3.594 \n", "2859 2 10 2012-10-26 34260.76 69.79 3.506 \n", "\n", " MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI \\\n", "0 NaN NaN NaN NaN NaN 211.096358 \n", "1 NaN NaN NaN NaN NaN 211.242170 \n", "2 NaN NaN NaN NaN NaN 211.289143 \n", "3 NaN NaN NaN NaN NaN 211.319643 \n", "4 NaN NaN NaN NaN NaN 211.350143 \n", "... ... ... ... ... ... ... \n", "2855 7106.05 1.91 1.65 1549.10 3946.03 222.616433 \n", "2856 6037.76 NaN 10.04 3027.37 3853.40 222.815930 \n", "2857 2145.50 NaN 33.31 586.83 10421.01 223.015426 \n", "2858 4461.89 NaN 1.14 1579.67 2642.29 223.059808 \n", "2859 6152.59 129.77 200.00 272.29 2924.15 223.078337 \n", "\n", " Unemployment IsHoliday \n", "0 8.106 False \n", "1 8.106 True \n", "2 8.106 False \n", "3 8.106 False \n", "4 8.106 False \n", "... ... ... \n", "2855 6.565 False \n", "2856 6.170 False \n", "2857 6.170 False \n", "2858 6.170 False \n", "2859 6.170 False \n", "\n", "[2860 rows x 14 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "csv = os.path.join(\"..\", \"data\", \"walmart\", \"walmart_mini.csv\")\n", "display(pd.read_csv(csv))" ] }, { "cell_type": "markdown", "id": "5fde813d", "metadata": {}, "source": [ "As before, we have a column `Date` indicating the time. Note that in this case, we have a string rather than a timestamp; this is also okay. However, we now also have some index columns `Store` and `Dept` which are used to distinguish between different time series. We specify these to the data loader." ] }, { "cell_type": "code", "execution_count": 9, "id": "fe500896", "metadata": {}, "outputs": [], "source": [ "from ts_datasets.forecast import CustomDataset\n", "dataset = CustomDataset(\n", " rootdir=csv, # where the data is stored\n", " index_cols=[\"Store\", \"Dept\"], # Individual time series are indexed by store & department\n", " test_frac=0.75, # use 25% of each time series for testing. \n", " # overridden if the column `trainval` is in the actual CSV.\n", ")" ] }, { "cell_type": "code", "execution_count": 10, "id": "8ca5296f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 20 time series in this dataset.\n" ] } ], "source": [ "print(f\"There are {len(dataset)} time series in this dataset.\")\n", "time_series, metadata = dataset[17]" ] }, { "cell_type": "markdown", "id": "7cfc92a8", "metadata": {}, "source": [ "This particular time series is multivariate." ] }, { "cell_type": "code", "execution_count": 11, "id": "301d9344", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Weekly_SalesTemperatureFuel_PriceMarkDown1MarkDown2MarkDown3MarkDown4MarkDown5CPIUnemploymentIsHoliday
Date
2010-02-0569634.8040.192.572NaNNaNNaNNaNNaN210.7526058.324False
2010-02-1263393.2938.492.548NaNNaNNaNNaNNaN210.8979948.324True
2010-02-1966589.2739.692.514NaNNaNNaNNaNNaN210.9451608.324False
2010-02-2661875.4846.102.561NaNNaNNaNNaNNaN210.9759578.324False
2010-03-0567041.1847.172.625NaNNaNNaNNaNNaN211.0067548.324False
....................................
2012-09-2857424.0079.453.6667106.051.911.651549.103946.03222.6164336.565False
2012-10-0562955.5170.273.6176037.76NaN10.043027.373853.40222.8159306.170False
2012-10-1263083.6360.973.6012145.50NaN33.31586.8310421.01223.0154266.170False
2012-10-1960502.9768.083.5944461.89NaN1.141579.672642.29223.0598086.170False
2012-10-2663992.3669.793.5066152.59129.77200.00272.292924.15223.0783376.170False
\n", "

143 rows × 11 columns

\n", "
" ], "text/plain": [ " Weekly_Sales Temperature Fuel_Price MarkDown1 MarkDown2 \\\n", "Date \n", "2010-02-05 69634.80 40.19 2.572 NaN NaN \n", "2010-02-12 63393.29 38.49 2.548 NaN NaN \n", "2010-02-19 66589.27 39.69 2.514 NaN NaN \n", "2010-02-26 61875.48 46.10 2.561 NaN NaN \n", "2010-03-05 67041.18 47.17 2.625 NaN NaN \n", "... ... ... ... ... ... \n", "2012-09-28 57424.00 79.45 3.666 7106.05 1.91 \n", "2012-10-05 62955.51 70.27 3.617 6037.76 NaN \n", "2012-10-12 63083.63 60.97 3.601 2145.50 NaN \n", "2012-10-19 60502.97 68.08 3.594 4461.89 NaN \n", "2012-10-26 63992.36 69.79 3.506 6152.59 129.77 \n", "\n", " MarkDown3 MarkDown4 MarkDown5 CPI Unemployment \\\n", "Date \n", "2010-02-05 NaN NaN NaN 210.752605 8.324 \n", "2010-02-12 NaN NaN NaN 210.897994 8.324 \n", "2010-02-19 NaN NaN NaN 210.945160 8.324 \n", "2010-02-26 NaN NaN NaN 210.975957 8.324 \n", "2010-03-05 NaN NaN NaN 211.006754 8.324 \n", "... ... ... ... ... ... \n", "2012-09-28 1.65 1549.10 3946.03 222.616433 6.565 \n", "2012-10-05 10.04 3027.37 3853.40 222.815930 6.170 \n", "2012-10-12 33.31 586.83 10421.01 223.015426 6.170 \n", "2012-10-19 1.14 1579.67 2642.29 223.059808 6.170 \n", "2012-10-26 200.00 272.29 2924.15 223.078337 6.170 \n", "\n", " IsHoliday \n", "Date \n", "2010-02-05 False \n", "2010-02-12 True \n", "2010-02-19 False \n", "2010-02-26 False \n", "2010-03-05 False \n", "... ... \n", "2012-09-28 False \n", "2012-10-05 False \n", "2012-10-12 False \n", "2012-10-19 False \n", "2012-10-26 False \n", "\n", "[143 rows x 11 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(time_series)" ] }, { "cell_type": "markdown", "id": "33926c81", "metadata": {}, "source": [ "The metadata has the same timestamps as the time series. It has a \"trainval\" column as before, plus index columns \"Store\" and \"Dept\"." ] }, { "cell_type": "code", "execution_count": 12, "id": "4d3cd301", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trainvalStoreDept
Date
2010-02-05True28
2010-02-12True28
2010-02-19True28
2010-02-26True28
2010-03-05True28
............
2012-09-28False28
2012-10-05False28
2012-10-12False28
2012-10-19False28
2012-10-26False28
\n", "

143 rows × 3 columns

\n", "
" ], "text/plain": [ " trainval Store Dept\n", "Date \n", "2010-02-05 True 2 8\n", "2010-02-12 True 2 8\n", "2010-02-19 True 2 8\n", "2010-02-26 True 2 8\n", "2010-03-05 True 2 8\n", "... ... ... ...\n", "2012-09-28 False 2 8\n", "2012-10-05 False 2 8\n", "2012-10-12 False 2 8\n", "2012-10-19 False 2 8\n", "2012-10-26 False 2 8\n", "\n", "[143 rows x 3 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(metadata)" ] }, { "cell_type": "markdown", "id": "19562928", "metadata": {}, "source": [ "## Broader Takeaways\n", "\n", "In general, a dataset can contain any number of CSVs stored under a single root directory. Each CSV can contain one or more time series, where the different time series within a single file are indicated by different values of the index column. Note that this works for anomaly detection as well! You just need to make sure that your CSVs all contain the `anomaly` column. In general, all features supported by `CustomDataset` are also supported by `CustomAnomalyDataset`, as long as your CSV files have the `anomaly` column.\n", "\n", "If you want to either of the above custom datasets for benchmarking, you can call\n", "\n", "```\n", "python benchmark_anomaly.py --model IsolationForest --retrain_freq 7d \\\n", " --dataset CustomAnomalyDataset --data_root data/synthetic_anomaly \\\n", " --data_kwargs '{\"assume_no_anomaly\": true, \"test_frac\": 0.75}'\n", "```\n", "\n", "or \n", "\n", "```\n", "python benchmark_forecast.py --model AutoETS \\\n", " --dataset CustomDataset --data_root data/walmart/walmart_mini.csv \\\n", " --data_kwargs '{\"test_frac\": 0.25, \\\n", " \"index_cols\": [\"Store\", \"Dept\"], \\\n", " \"data_cols\": [\"Weekly_Sales\"]}'\n", "```\n", "\n", "Note in the example above, we specify \"data_cols\" as \"Weekly_Sales\". This indicates that the only column we are modeling is Weekly_Sales. If you wanted to do multivariate prediction, you could also add \"Temperature\", \"Fuel_Price\", \"CPI\", etc. We treat the first of the data columns as the target univariate whose value you wish to forecast." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" } }, "nbformat": 4, "nbformat_minor": 5 }