Hierarchical time series notebook¶

This notebook contains examples of modelling hierarchical time series.

Table of contents * Hierarchical time series * Preparing dataset * Manually setting hierarchical structure * Convert dataset to ETNA wide format * Creat HierarchicalStructure * Create hierarchical dataset * Hierarchical structure detection * Prepare data in ETNA hierarchical long format * Convert data to etna wide format with ``to_hierarchical_dataset` <#chapter2_2_2>`__ * Create the hierarchical dataset * Reconciliation methods * Bottom-up approach * Top-down approach * Exogenous variables for hierarchical forecasts

[1]:

import warnings

warnings.filterwarnings("ignore")

1. Hierarchical time series¶

In many applications time series have a natural level structure. Time series with such properties can be disaggregated by attributes from lower levels. On the other hand, this time series can be aggregated to higher levels to represent more general relations. The set of possible levels forms the hierarchy of time series.

Hierarchy example Two level hierarchical structure

Image above represents relations between members of the hierarchy. Middle and top levels can be disaggregated using members from lower levels. For example

\[y_{A,t} = y_{AA,t} + y_{AB,t}\]

\[y_{t} = y_{A,t} + y_{B,t}\]

In matrix notation level aggregation could be written as

:nbsphinx-math:`begin{equation*}

begin{bmatrix}: y_{A,t} \ y_t

end{bmatrix} = begin{bmatrix} 1 & 1 & 0 \ 1 & 1 & 1 end{bmatrix} begin{bmatrix} y_{AA,t} \ y_{AB,t} \ y_{B,t} end{bmatrix} = S begin{bmatrix} y_{AA,t} \ y_{AB,t} \ y_{B,t} end{bmatrix}

end{equation*}` where \(S\) - summing matrix.

2.Preparing dataset¶

Consider the Australian tourism dataset.

This dataset consists of the following components:

Total - total domestic tourism demand,
Tourism reasons components (Hol for holiday, Bus for business, etc)
Components representing the “region-reason” division (NSW - hol, NSW - bus, etc)
Components representing “region - reason - city” division (NSW - hol - city, NSW - hol - noncity, etc)

We can see that these components form a hierarchy with the following levels:: 1. Total 2. Tourism reason 3. Region 4. City

[2]:

import pandas as pd

pd.options.display.max_columns = 100

[3]:

!curl "https://robjhyndman.com/data/hier1_with_names.csv" --ssl-no-revoke -o "hier1_with_names.csv"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 15664  100 15664    0     0  17366      0 --:--:-- --:--:-- --:--:-- 17385

[4]:

df = pd.read_csv("hier1_with_names.csv")

periods = len(df)
df["timestamp"] = pd.date_range("2006-01-01", periods=periods, freq="MS")
df.set_index("timestamp", inplace=True)

df.head()

[4]:

	Total	Hol	VFR	Bus	Oth	NSW - hol	VIC - hol	QLD - hol	SA - hol	WA - hol	TAS - hol	NT - hol	NSW - vfr	VIC - vfr	QLD - vfr	SA - vfr	WA - vfr	TAS - vfr	NT - vfr	NSW - bus	VIC - bus	QLD - bus	SA - bus	WA - bus	TAS - bus	NT - bus	NSW - oth	VIC - oth	QLD - oth	SA - oth	WA - oth	TAS - oth	NT - oth	NSW - hol - city	NSW - hol - noncity	VIC - hol - city	VIC - hol - noncity	QLD - hol - city	QLD - hol - noncity	SA - hol - city	SA - hol - noncity	WA - hol - city	WA - hol - noncity	TAS - hol - city	TAS - hol - noncity	NT - hol - city	NT - hol - noncity	NSW - vfr - city	NSW - vfr - noncity	VIC - vfr - city	VIC - vfr - noncity	QLD - vfr - city	QLD - vfr - noncity	SA - vfr - city	SA - vfr - noncity	WA - vfr - city	WA - vfr - noncity	TAS - vfr - city	TAS - vfr - noncity	NT - vfr - city	NT - vfr - noncity	NSW - bus - city	NSW - bus - noncity	VIC - bus - city	VIC - bus - noncity	QLD - bus - city	QLD - bus - noncity	SA - bus - city	SA - bus - noncity	WA - bus - city	WA - bus - noncity	TAS - bus - city	TAS - bus - noncity	NT - bus - city	NT - bus - noncity	NSW - oth - city	NSW - oth - noncity	VIC - oth - city	VIC - oth - noncity	QLD - oth - city	QLD - oth - noncity	SA - oth - city	SA - oth - noncity	WA - oth - city	WA - oth - noncity	TAS - oth - city	TAS - oth - noncity	NT - oth - city	NT - oth - noncity
timestamp
2006-01-01	84503	45906	26042	9815	2740	17589	10412	9078	3089	3449	2102	187	9398	5993	5290	2193	1781	1350	37	2885	2148	2093	844	1406	223	216	906	467	702	317	205	100	43	3096	14493	2531	7881	4688	4390	888	2201	1383	2066	619	1483	101	86	2709	6689	2565	3428	3003	2287	1324	869	1019	762	602	748	28	9	1201	1684	1164	984	1111	982	388	456	532	874	116	107	136	80	396	510	181	286	431	271	244	73	168	37	76	24	35	8
2006-02-01	65312	29347	20676	11823	3466	11027	6025	6310	1935	2454	1098	498	7829	4107	4902	1445	1353	523	517	4301	1825	2224	749	2043	373	308	1238	552	839	363	269	97	108	1479	9548	1439	4586	2320	3990	521	1414	1059	1395	409	689	201	297	2184	5645	1852	2255	1957	2945	806	639	750	603	257	266	168	349	2020	2281	1014	811	776	1448	346	403	356	1687	83	290	138	170	657	581	229	323	669	170	142	221	170	99	36	61	69	39
2006-03-01	72753	32492	20582	13565	6114	8910	5060	11733	1569	3398	458	1364	7277	3811	5489	1453	1687	391	474	4093	1944	3379	750	1560	303	1536	1433	446	1434	712	1546	55	488	1609	7301	1488	3572	4758	6975	476	1093	1101	2297	127	331	619	745	2225	5052	1882	1929	2619	2870	1078	375	953	734	130	261	390	84	1975	2118	1153	791	1079	2300	390	360	440	1120	196	107	452	1084	540	893	128	318	270	1164	397	315	380	1166	32	23	150	338
2006-04-01	70880	31813	21613	11478	5976	10658	5481	8109	2270	3561	1320	414	8303	5090	4441	1209	1714	394	462	3463	1753	2880	890	1791	298	403	1902	606	749	454	1549	91	625	1520	9138	1906	3575	3328	4781	571	1699	1128	2433	371	949	164	250	2918	5385	2208	2882	2097	2344	568	641	999	715	137	257	244	218	1500	1963	1245	508	1128	1752	255	635	539	1252	70	228	243	160	745	1157	270	336	214	535	194	260	410	1139	48	43	172	453
2006-05-01	86893	46793	26947	10027	3126	16152	10958	10047	3023	4287	2113	213	10386	6152	5636	1685	2026	784	278	3347	1522	2751	666	1023	335	383	984	558	1015	180	190	137	62	1958	14194	2517	8441	4930	5117	873	2150	1560	2727	523	1590	62	151	3154	7232	2988	3164	2703	2933	887	798	1396	630	347	437	153	125	1196	2151	950	572	1192	1559	386	280	582	441	130	205	194	189	426	558	265	293	458	557	147	33	162	28	77	60	15	47

2.1 Manually setting hierarchical structure¶

This section presents how to set hierarchical structure and prepare data. We are going to create a hierarchical dataset with two levels: total demand and demand per tourism reason.

[5]:

from etna.datasets import TSDataset

Consider the Reason level of the hierarchy.

[6]:

reason_segments = ["Hol", "VFR", "Bus", "Oth"]

df[reason_segments].head()

[6]:

	Hol	VFR	Bus	Oth
timestamp
2006-01-01	45906	26042	9815	2740
2006-02-01	29347	20676	11823	3466
2006-03-01	32492	20582	13565	6114
2006-04-01	31813	21613	11478	5976
2006-05-01	46793	26947	10027	3126

2.1.1 Convert dataset to ETNA wide format¶

First, convert dataframe to ETNA long format.

[7]:

hierarchical_df = []
for segment_name in reason_segments:
    segment = df[segment_name]

    segment_slice = pd.DataFrame(
        {"timestamp": segment.index, "target": segment.values, "segment": [segment_name] * periods}
    )
    hierarchical_df.append(segment_slice)

hierarchical_df = pd.concat(hierarchical_df, axis=0)

hierarchical_df.head()

[7]:

	timestamp	target	segment
0	2006-01-01	45906	Hol
1	2006-02-01	29347	Hol
2	2006-03-01	32492	Hol
3	2006-04-01	31813	Hol
4	2006-05-01	46793	Hol

Now, the dataframe could be converted to ETNA wide format.

[8]:

hierarchical_df = TSDataset.to_dataset(df=hierarchical_df)

2.1.2 Creat HierarchicalStructure¶

For handling information about hierarchical structure, there is a dedicated object in the ETNA library: HierarchicalStructure.

To create HierarchicalStructure define relationships between segments at different levels. This relation should be described as mapping between levels members, where keys are parent segments and values are lists of child segments from the lower level. Also provide a list of level names, where ordering corresponds to hierarchical relationships between levels.

[9]:

from etna.datasets import HierarchicalStructure

[10]:

hierarchical_structure = HierarchicalStructure(
    level_structure={"total": ["Hol", "VFR", "Bus", "Oth"]}, level_names=["total", "reason"]
)

hierarchical_structure

[10]:

HierarchicalStructure(level_structure = {'total': ['Hol', 'VFR', 'Bus', 'Oth']}, level_names = ['total', 'reason'], )

2.1.3 Create hierarchical dataset¶

When all the data is prepared, call the TSDataset constructor to create a hierarchical dataset.

[11]:

hierarchical_ts = TSDataset(df=hierarchical_df, freq="MS", hierarchical_structure=hierarchical_structure)

hierarchical_ts.head()

[11]:

segment	Bus	Hol	Oth	VFR
feature	target	target	target	target
timestamp
2006-01-01	9815	45906	2740	26042
2006-02-01	11823	29347	3466	20676
2006-03-01	13565	32492	6114	20582
2006-04-01	11478	31813	5976	21613
2006-05-01	10027	46793	3126	26947

Ensure that the dataset is at the desired level.

[12]:

hierarchical_ts.current_df_level

[12]:

'reason'

2.2 Hierarchical structure detection¶

This section presents how to prepare data and detect hierarchical structure. The main advantage of this approach for creating hierarchical structures is that you don’t need to define an adjacency list. All hierarchical relationships would be detected from the dataframe columns.

The main applications for this approach are when defining the adjacency list is not desirable or when some columns of the dataframe already have information about hierarchy (e.g. related categorical columns).

A data frame must be prepared in a specific format for detection to work. The following sections show how to do so.

Consider the City level of the hierarchy.

[13]:

city_segments = list(filter(lambda name: name.count("-") == 2, df.columns))

df[city_segments].head()

[13]:

	NSW - hol - city	NSW - hol - noncity	VIC - hol - city	VIC - hol - noncity	QLD - hol - city	QLD - hol - noncity	SA - hol - city	SA - hol - noncity	WA - hol - city	WA - hol - noncity	TAS - hol - city	TAS - hol - noncity	NT - hol - city	NT - hol - noncity	NSW - vfr - city	NSW - vfr - noncity	VIC - vfr - city	VIC - vfr - noncity	QLD - vfr - city	QLD - vfr - noncity	SA - vfr - city	SA - vfr - noncity	WA - vfr - city	WA - vfr - noncity	TAS - vfr - city	TAS - vfr - noncity	NT - vfr - city	NT - vfr - noncity	NSW - bus - city	NSW - bus - noncity	VIC - bus - city	VIC - bus - noncity	QLD - bus - city	QLD - bus - noncity	SA - bus - city	SA - bus - noncity	WA - bus - city	WA - bus - noncity	TAS - bus - city	TAS - bus - noncity	NT - bus - city	NT - bus - noncity	NSW - oth - city	NSW - oth - noncity	VIC - oth - city	VIC - oth - noncity	QLD - oth - city	QLD - oth - noncity	SA - oth - city	SA - oth - noncity	WA - oth - city	WA - oth - noncity	TAS - oth - city	TAS - oth - noncity	NT - oth - city	NT - oth - noncity
timestamp
2006-01-01	3096	14493	2531	7881	4688	4390	888	2201	1383	2066	619	1483	101	86	2709	6689	2565	3428	3003	2287	1324	869	1019	762	602	748	28	9	1201	1684	1164	984	1111	982	388	456	532	874	116	107	136	80	396	510	181	286	431	271	244	73	168	37	76	24	35	8
2006-02-01	1479	9548	1439	4586	2320	3990	521	1414	1059	1395	409	689	201	297	2184	5645	1852	2255	1957	2945	806	639	750	603	257	266	168	349	2020	2281	1014	811	776	1448	346	403	356	1687	83	290	138	170	657	581	229	323	669	170	142	221	170	99	36	61	69	39
2006-03-01	1609	7301	1488	3572	4758	6975	476	1093	1101	2297	127	331	619	745	2225	5052	1882	1929	2619	2870	1078	375	953	734	130	261	390	84	1975	2118	1153	791	1079	2300	390	360	440	1120	196	107	452	1084	540	893	128	318	270	1164	397	315	380	1166	32	23	150	338
2006-04-01	1520	9138	1906	3575	3328	4781	571	1699	1128	2433	371	949	164	250	2918	5385	2208	2882	2097	2344	568	641	999	715	137	257	244	218	1500	1963	1245	508	1128	1752	255	635	539	1252	70	228	243	160	745	1157	270	336	214	535	194	260	410	1139	48	43	172	453
2006-05-01	1958	14194	2517	8441	4930	5117	873	2150	1560	2727	523	1590	62	151	3154	7232	2988	3164	2703	2933	887	798	1396	630	347	437	153	125	1196	2151	950	572	1192	1559	386	280	582	441	130	205	194	189	426	558	265	293	458	557	147	33	162	28	77	60	15	47

2.2.1 Prepare data in ETNA hierarchical long format¶

Before trying to detect a hierarchical structure, data must be transformed to hierarchical long format. In this format, your DataFrame must contain timestamp, target and level columns. Each level column represents membership of the observation at higher levels of the hierarchy.

[14]:

hierarchical_df = []
for segment_name in city_segments:
    segment = df[segment_name]
    region, reason, city = segment_name.split(" - ")

    seg_df = pd.DataFrame(
        data={
            "timestamp": segment.index,
            "target": segment.values,
            "city_level": [city] * periods,
            "region_level": [region] * periods,
            "reason_level": [reason] * periods,
        },
    )
    hierarchical_df.append(seg_df)

hierarchical_df = pd.concat(hierarchical_df, axis=0)

hierarchical_df["reason_level"].replace({"hol": "Hol", "vfr": "VFR", "bus": "Bus", "oth": "Oth"}, inplace=True)
hierarchical_df.head()

[14]:

	timestamp	target	city_level	region_level	reason_level
0	2006-01-01	3096	city	NSW	Hol
1	2006-02-01	1479	city	NSW	Hol
2	2006-03-01	1609	city	NSW	Hol
3	2006-04-01	1520	city	NSW	Hol
4	2006-05-01	1958	city	NSW	Hol

Here we can omit total level as it will be added automatically in hierarchy detection.

2.2.2 Convert data to etna wide format with `to_hierarchical_dataset`¶

To detect hierarchical structure and convert DataFrame to ETNA wide format, call TSDataset.to_hierarchical_dataset, provided with prepared data and level columns names in order from top to bottom.

[15]:

hierarchical_df, hierarchical_structure = TSDataset.to_hierarchical_dataset(
    df=hierarchical_df, level_columns=["reason_level", "region_level", "city_level"]
)

hierarchical_df.head()

[15]:

segment	Bus_NSW_city	Bus_NSW_noncity	Bus_NT_city	Bus_NT_noncity	Bus_QLD_city	Bus_QLD_noncity	Bus_SA_city	Bus_SA_noncity	Bus_TAS_city	Bus_TAS_noncity	Bus_VIC_city	Bus_VIC_noncity	Bus_WA_city	Bus_WA_noncity	Hol_NSW_city	Hol_NSW_noncity	Hol_NT_city	Hol_NT_noncity	Hol_QLD_city	Hol_QLD_noncity	Hol_SA_city	Hol_SA_noncity	Hol_TAS_city	Hol_TAS_noncity	Hol_VIC_city	Hol_VIC_noncity	Hol_WA_city	Hol_WA_noncity	Oth_NSW_city	Oth_NSW_noncity	Oth_NT_city	Oth_NT_noncity	Oth_QLD_city	Oth_QLD_noncity	Oth_SA_city	Oth_SA_noncity	Oth_TAS_city	Oth_TAS_noncity	Oth_VIC_city	Oth_VIC_noncity	Oth_WA_city	Oth_WA_noncity	VFR_NSW_city	VFR_NSW_noncity	VFR_NT_city	VFR_NT_noncity	VFR_QLD_city	VFR_QLD_noncity	VFR_SA_city	VFR_SA_noncity	VFR_TAS_city	VFR_TAS_noncity	VFR_VIC_city	VFR_VIC_noncity	VFR_WA_city	VFR_WA_noncity
feature	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target
timestamp
2006-01-01	1201	1684	136	80	1111	982	388	456	116	107	1164	984	532	874	3096	14493	101	86	4688	4390	888	2201	619	1483	2531	7881	1383	2066	396	510	35	8	431	271	244	73	76	24	181	286	168	37	2709	6689	28	9	3003	2287	1324	869	602	748	2565	3428	1019	762
2006-02-01	2020	2281	138	170	776	1448	346	403	83	290	1014	811	356	1687	1479	9548	201	297	2320	3990	521	1414	409	689	1439	4586	1059	1395	657	581	69	39	669	170	142	221	36	61	229	323	170	99	2184	5645	168	349	1957	2945	806	639	257	266	1852	2255	750	603
2006-03-01	1975	2118	452	1084	1079	2300	390	360	196	107	1153	791	440	1120	1609	7301	619	745	4758	6975	476	1093	127	331	1488	3572	1101	2297	540	893	150	338	270	1164	397	315	32	23	128	318	380	1166	2225	5052	390	84	2619	2870	1078	375	130	261	1882	1929	953	734
2006-04-01	1500	1963	243	160	1128	1752	255	635	70	228	1245	508	539	1252	1520	9138	164	250	3328	4781	571	1699	371	949	1906	3575	1128	2433	745	1157	172	453	214	535	194	260	48	43	270	336	410	1139	2918	5385	244	218	2097	2344	568	641	137	257	2208	2882	999	715
2006-05-01	1196	2151	194	189	1192	1559	386	280	130	205	950	572	582	441	1958	14194	62	151	4930	5117	873	2150	523	1590	2517	8441	1560	2727	426	558	15	47	458	557	147	33	77	60	265	293	162	28	3154	7232	153	125	2703	2933	887	798	347	437	2988	3164	1396	630

[16]:

hierarchical_structure

[16]:

HierarchicalStructure(level_structure = {'total': ['Hol', 'VFR', 'Bus', 'Oth'], 'Bus': ['Bus_NSW', 'Bus_VIC', 'Bus_QLD', 'Bus_SA', 'Bus_WA', 'Bus_TAS', 'Bus_NT'], 'Hol': ['Hol_NSW', 'Hol_VIC', 'Hol_QLD', 'Hol_SA', 'Hol_WA', 'Hol_TAS', 'Hol_NT'], 'Oth': ['Oth_NSW', 'Oth_VIC', 'Oth_QLD', 'Oth_SA', 'Oth_WA', 'Oth_TAS', 'Oth_NT'], 'VFR': ['VFR_NSW', 'VFR_VIC', 'VFR_QLD', 'VFR_SA', 'VFR_WA', 'VFR_TAS', 'VFR_NT'], 'Bus_NSW': ['Bus_NSW_city', 'Bus_NSW_noncity'], 'Bus_NT': ['Bus_NT_city', 'Bus_NT_noncity'], 'Bus_QLD': ['Bus_QLD_city', 'Bus_QLD_noncity'], 'Bus_SA': ['Bus_SA_city', 'Bus_SA_noncity'], 'Bus_TAS': ['Bus_TAS_city', 'Bus_TAS_noncity'], 'Bus_VIC': ['Bus_VIC_city', 'Bus_VIC_noncity'], 'Bus_WA': ['Bus_WA_city', 'Bus_WA_noncity'], 'Hol_NSW': ['Hol_NSW_city', 'Hol_NSW_noncity'], 'Hol_NT': ['Hol_NT_city', 'Hol_NT_noncity'], 'Hol_QLD': ['Hol_QLD_city', 'Hol_QLD_noncity'], 'Hol_SA': ['Hol_SA_city', 'Hol_SA_noncity'], 'Hol_TAS': ['Hol_TAS_city', 'Hol_TAS_noncity'], 'Hol_VIC': ['Hol_VIC_city', 'Hol_VIC_noncity'], 'Hol_WA': ['Hol_WA_city', 'Hol_WA_noncity'], 'Oth_NSW': ['Oth_NSW_city', 'Oth_NSW_noncity'], 'Oth_NT': ['Oth_NT_city', 'Oth_NT_noncity'], 'Oth_QLD': ['Oth_QLD_city', 'Oth_QLD_noncity'], 'Oth_SA': ['Oth_SA_city', 'Oth_SA_noncity'], 'Oth_TAS': ['Oth_TAS_city', 'Oth_TAS_noncity'], 'Oth_VIC': ['Oth_VIC_city', 'Oth_VIC_noncity'], 'Oth_WA': ['Oth_WA_city', 'Oth_WA_noncity'], 'VFR_NSW': ['VFR_NSW_city', 'VFR_NSW_noncity'], 'VFR_NT': ['VFR_NT_city', 'VFR_NT_noncity'], 'VFR_QLD': ['VFR_QLD_city', 'VFR_QLD_noncity'], 'VFR_SA': ['VFR_SA_city', 'VFR_SA_noncity'], 'VFR_TAS': ['VFR_TAS_city', 'VFR_TAS_noncity'], 'VFR_VIC': ['VFR_VIC_city', 'VFR_VIC_noncity'], 'VFR_WA': ['VFR_WA_city', 'VFR_WA_noncity']}, level_names = ['total', 'reason_level', 'region_level', 'city_level'], )

Here we see that hierarchical_structure has a mapping between higher level segments and adjacent lower level segments.

2.2.3 Create the hierarchical dataset¶

To convert data to TSDataset call the constructor and pass detected hierarchical_structure.

[17]:

hierarchical_ts = TSDataset(df=hierarchical_df, freq="MS", hierarchical_structure=hierarchical_structure)
hierarchical_ts.head()

[17]:

segment	Bus_NSW_city	Bus_NSW_noncity	Bus_NT_city	Bus_NT_noncity	Bus_QLD_city	Bus_QLD_noncity	Bus_SA_city	Bus_SA_noncity	Bus_TAS_city	Bus_TAS_noncity	Bus_VIC_city	Bus_VIC_noncity	Bus_WA_city	Bus_WA_noncity	Hol_NSW_city	Hol_NSW_noncity	Hol_NT_city	Hol_NT_noncity	Hol_QLD_city	Hol_QLD_noncity	Hol_SA_city	Hol_SA_noncity	Hol_TAS_city	Hol_TAS_noncity	Hol_VIC_city	Hol_VIC_noncity	Hol_WA_city	Hol_WA_noncity	Oth_NSW_city	Oth_NSW_noncity	Oth_NT_city	Oth_NT_noncity	Oth_QLD_city	Oth_QLD_noncity	Oth_SA_city	Oth_SA_noncity	Oth_TAS_city	Oth_TAS_noncity	Oth_VIC_city	Oth_VIC_noncity	Oth_WA_city	Oth_WA_noncity	VFR_NSW_city	VFR_NSW_noncity	VFR_NT_city	VFR_NT_noncity	VFR_QLD_city	VFR_QLD_noncity	VFR_SA_city	VFR_SA_noncity	VFR_TAS_city	VFR_TAS_noncity	VFR_VIC_city	VFR_VIC_noncity	VFR_WA_city	VFR_WA_noncity
feature	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target
timestamp
2006-01-01	1201	1684	136	80	1111	982	388	456	116	107	1164	984	532	874	3096	14493	101	86	4688	4390	888	2201	619	1483	2531	7881	1383	2066	396	510	35	8	431	271	244	73	76	24	181	286	168	37	2709	6689	28	9	3003	2287	1324	869	602	748	2565	3428	1019	762
2006-02-01	2020	2281	138	170	776	1448	346	403	83	290	1014	811	356	1687	1479	9548	201	297	2320	3990	521	1414	409	689	1439	4586	1059	1395	657	581	69	39	669	170	142	221	36	61	229	323	170	99	2184	5645	168	349	1957	2945	806	639	257	266	1852	2255	750	603
2006-03-01	1975	2118	452	1084	1079	2300	390	360	196	107	1153	791	440	1120	1609	7301	619	745	4758	6975	476	1093	127	331	1488	3572	1101	2297	540	893	150	338	270	1164	397	315	32	23	128	318	380	1166	2225	5052	390	84	2619	2870	1078	375	130	261	1882	1929	953	734
2006-04-01	1500	1963	243	160	1128	1752	255	635	70	228	1245	508	539	1252	1520	9138	164	250	3328	4781	571	1699	371	949	1906	3575	1128	2433	745	1157	172	453	214	535	194	260	48	43	270	336	410	1139	2918	5385	244	218	2097	2344	568	641	137	257	2208	2882	999	715
2006-05-01	1196	2151	194	189	1192	1559	386	280	130	205	950	572	582	441	1958	14194	62	151	4930	5117	873	2150	523	1590	2517	8441	1560	2727	426	558	15	47	458	557	147	33	77	60	265	293	162	28	3154	7232	153	125	2703	2933	887	798	347	437	2988	3164	1396	630

Now the dataset converted to hierarchical. We can examine what hierarchical levels were detected with the following code.

[18]:

hierarchical_ts.hierarchical_structure.level_names

[18]:

['total', 'reason_level', 'region_level', 'city_level']

Ensure that the dataset is at the desired level.

[19]:

hierarchical_ts.current_df_level

[19]:

'city_level'

The hierarchical dataset could be aggregated to higher levels with the get_level_dataset method.

[20]:

hierarchical_ts.get_level_dataset(target_level="reason_level").head()

[20]:

segment	Hol	VFR	Bus	Oth
feature	target	target	target	target
timestamp
2006-01-01	45906	26042	9815	2740
2006-02-01	29347	20676	11823	3466
2006-03-01	32492	20582	13565	6114
2006-04-01	31813	21613	11478	5976
2006-05-01	46793	26947	10027	3126

3. Reconciliation methods¶

In this section we will examine the reconciliation methods available in ETNA. Hierarchical time series reconciliation allows for the readjustment of predictions produced by separate models on a set of hierarchically linked time series in order to make the forecasts more accurate, and ensure that they are coherent.

There are several reconciliation methods in the ETNA library: * Bottom-up approach * Top-down approach

Middle-out reconciliation approach could be viewed as a composition of bottom-up and top-down approaches. This method could be implemented using functionality from the library. For aggregation to higher levels, one could use provided bottom-up methods, and for disaggregation to lower levels – top-down methods.

Reconciliation methods estimate mapping matrices to perform transitions between levels. These matrices are sparse. ETNA uses scipy.sparse.csr_matrix to efficiently store them and perform computation.

More detailed information about this and other reconciliation methods can be found here.

[21]:

from etna.transforms import LagTransform
from etna.transforms import OneHotEncoderTransform
from etna.models import LinearPerSegmentModel
from etna.metrics import SMAPE
from etna.pipeline import HierarchicalPipeline
from etna.pipeline import Pipeline

Some important definitions: * source level - level of the hierarchy for model estimation, reconciliation applied to this level * target level - desired level of the hierarchy, after reconciliation we want to have series at this level.

3.1. Bottom-up approach¶

The main idea of this approach is to produce forecasts for time series at lower hierarchical levels and then perform aggregation to higher levels.

\[\hat y_{A,h} = \hat y_{AA,h} + \hat y_{AB,h}\]

\[\hat y_{B,h} = \hat y_{BA,h} + \hat y_{BB,h} + \hat y_{BC,h}\]

where \(h\) - forecast horizon.

In matrix notation:

:nbsphinx-math:`begin{equation*}: begin{bmatrix} hat y_{A,h} \ hat y_{B,h} end{bmatrix} = begin{bmatrix} 1 & 1 & 0 & 0 & 0 \ 0 & 0 & 1 & 1 & 1 end{bmatrix} begin{bmatrix} hat y_{AA,h} \ hat y_{AB,h} \ hat y_{BA,h} \ hat y_{BB,h} \ hat y_{BC,h} end{bmatrix}

end{equation*}`

An advantage of this approach is that we are forecasting at the bottom-level of a structure and are able to capture all the patterns of the individual series. On the other hand, bottom-level data can be quite noisy and more challenging to model and forecast.

[22]:

from etna.reconciliation import BottomUpReconciliator

To create BottomUpReconciliator specify “source” and “target” levels for aggregation. Make sure that the source level is lower in the hierarchy than the target level.

[23]:

reconciliator = BottomUpReconciliator(target_level="region_level", source_level="city_level")

The next step is mapping matrix estimation. To do so pass hierarchical dataset to fit method. Current dataset level should be lower or equal to source level.

[24]:

reconciliator.fit(ts=hierarchical_ts)
reconciliator.mapping_matrix.toarray()

[24]:

array([[1, 1, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 1, 1]])

Now reconciliator is ready to perform aggregation to target level.

[25]:

reconciliator.aggregate(ts=hierarchical_ts).head(5)

[25]:

segment	Bus_NSW_city	Bus_NSW_noncity	Bus_NT_city	Bus_NT_noncity	Bus_QLD_city	Bus_QLD_noncity	Bus_SA_city	Bus_SA_noncity	Bus_TAS_city	Bus_TAS_noncity	Bus_VIC_city	Bus_VIC_noncity	Bus_WA_city	Bus_WA_noncity	Hol_NSW_city	Hol_NSW_noncity	Hol_NT_city	Hol_NT_noncity	Hol_QLD_city	Hol_QLD_noncity	Hol_SA_city	Hol_SA_noncity	Hol_TAS_city	Hol_TAS_noncity	Hol_VIC_city	Hol_VIC_noncity	Hol_WA_city	Hol_WA_noncity	Oth_NSW_city	Oth_NSW_noncity	Oth_NT_city	Oth_NT_noncity	Oth_QLD_city	Oth_QLD_noncity	Oth_SA_city	Oth_SA_noncity	Oth_TAS_city	Oth_TAS_noncity	Oth_VIC_city	Oth_VIC_noncity	Oth_WA_city	Oth_WA_noncity	VFR_NSW_city	VFR_NSW_noncity	VFR_NT_city	VFR_NT_noncity	VFR_QLD_city	VFR_QLD_noncity	VFR_SA_city	VFR_SA_noncity	VFR_TAS_city	VFR_TAS_noncity	VFR_VIC_city	VFR_VIC_noncity	VFR_WA_city	VFR_WA_noncity
feature	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target
timestamp
2006-01-01	1201	1684	136	80	1111	982	388	456	116	107	1164	984	532	874	3096	14493	101	86	4688	4390	888	2201	619	1483	2531	7881	1383	2066	396	510	35	8	431	271	244	73	76	24	181	286	168	37	2709	6689	28	9	3003	2287	1324	869	602	748	2565	3428	1019	762
2006-02-01	2020	2281	138	170	776	1448	346	403	83	290	1014	811	356	1687	1479	9548	201	297	2320	3990	521	1414	409	689	1439	4586	1059	1395	657	581	69	39	669	170	142	221	36	61	229	323	170	99	2184	5645	168	349	1957	2945	806	639	257	266	1852	2255	750	603
2006-03-01	1975	2118	452	1084	1079	2300	390	360	196	107	1153	791	440	1120	1609	7301	619	745	4758	6975	476	1093	127	331	1488	3572	1101	2297	540	893	150	338	270	1164	397	315	32	23	128	318	380	1166	2225	5052	390	84	2619	2870	1078	375	130	261	1882	1929	953	734
2006-04-01	1500	1963	243	160	1128	1752	255	635	70	228	1245	508	539	1252	1520	9138	164	250	3328	4781	571	1699	371	949	1906	3575	1128	2433	745	1157	172	453	214	535	194	260	48	43	270	336	410	1139	2918	5385	244	218	2097	2344	568	641	137	257	2208	2882	999	715
2006-05-01	1196	2151	194	189	1192	1559	386	280	130	205	950	572	582	441	1958	14194	62	151	4930	5117	873	2150	523	1590	2517	8441	1560	2727	426	558	15	47	458	557	147	33	77	60	265	293	162	28	3154	7232	153	125	2703	2933	887	798	347	437	2988	3164	1396	630

HierarchicalPipeline provides the ability to perform forecasts reconciliation in pipeline. A couple of points to keep in mind while working with this type of pipeline: 1. Transforms and model work with the dataset on the source level. 2. Forecasts are automatically reconciliated to the target level, metrics reported for target level as well.

[26]:

pipeline = HierarchicalPipeline(
    transforms=[
        LagTransform(in_column="target", lags=[1, 2, 3, 4, 6, 12]),
    ],
    model=LinearPerSegmentModel(),
    reconciliator=BottomUpReconciliator(target_level="region_level", source_level="city_level"),
)

bottom_up_metrics, _, _ = pipeline.backtest(ts=hierarchical_ts, metrics=[SMAPE()], n_folds=3, aggregate_metrics=True)
bottom_up_metrics = bottom_up_metrics.set_index("segment").add_suffix("_bottom_up")

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.3s finished

3.2. Top-down approach¶

Top-down approach is based on the idea of generating forecasts for time series at higher hierarchy levels and then performing disaggregation to lower levels. This approach can be expressed with the following formula:

\begin{align*} \hat y_{AA,h} = p_{AA} \hat y_A, && \hat y_{AB,h} = p_{AB} \hat y_A, && \hat y_{BA,h} = p_{BA} \hat y_B, && \hat y_{BB,h} = p_{BB} \hat y_B, && \hat y_{BC,h} = p_{BC} \hat y_B \end{align*}

In matrix notations this equation could be rewritten as:

:nbsphinx-math:`begin{equation}: begin{bmatrix} hat y_{AA,h} \ hat y_{AB,h} \ hat y_{BA,h} \ hat y_{BB,h} \ hat y_{BC,h} end{bmatrix} = begin{bmatrix} p_{AA} & 0 & 0 & 0 & 0 \ 0 & p_{AB} & 0 & 0 & 0 \ 0 & 0 & p_{BA} & 0 & 0 \ 0 & 0 & 0 & p_{BB} & 0 \ 0 & 0 & 0 & 0 & p_{BC} \ end{bmatrix} begin{bmatrix} 1 & 0 \ 1 & 0 \ 0 & 1 \ 0 & 1 \ 0 & 1 \ end{bmatrix} begin{bmatrix} hat y_{A,h} \ hat y_{B,h} end{bmatrix} = P S^T begin{bmatrix} hat y_{A,h} \ hat y_{B,h} end{bmatrix}

end{equation}`

The main challenge of this approach is proportions estimation. In ETNA library, there are two methods available: * Average historical proportions (AHP) * Proportions of the historical averages (PHA)

Average historical proportions

This method is based on averaging historical proportions: \begin{equation} \large p_i = \frac{1}{n} \sum_{t = T - n}^{T} \frac{y_{i, t}}{y_t} \end{equation}

where \(n\) - window size, \(T\) - latest timestamp, \(y_{i, t}\) - time series at the lower level, \(y_t\) - time series at the higher level. Both \(y_{i, t}\) and \(y_t\) have hierarchical relationship.

Proportions of the historical averages This approach uses a proportion of the averages for estimation: \begin{equation} \large p_i = \sum_{t = T - n}^{T} \frac{y_{i, t}}{n} \Bigg / \sum_{t = T - n}^{T} \frac{y_t}{n} \end{equation}

where \(n\) - window size, \(T\) - latest timestamp, \(y_{i, t}\) - time series at the lower level, \(y_t\) - time series at the higher level. Both \(y_{i, t}\) and \(y_t\) have hierarchical relationship.

Described methods require only series at the higher level for forecasting. Advantages of this approach are: simplicity and reliability. Loss of information is main disadvantage of the approach.

This method can be useful when it is needed to forecast lower level series, but some of them have more noise. Aggregation to a higher level and reconciliation back helps to use more meaningful information while modeling.

[27]:

from etna.reconciliation import TopDownReconciliator

TopDownReconciliator accepts four arguments in its constructor. You need to specify reconciliation levels, a method and a window size. First, let’s look at the AHP top-down reconciliation method.

[28]:

ahp_reconciliator = TopDownReconciliator(
    target_level="region_level", source_level="reason_level", method="AHP", period=6
)

The top-down approach has slightly different dataset levels requirements in comparison to the bottom-up method. Here source level should be higher than the target level, and the current dataset level should not be higher than the target level.

After all level requirements met and the reconciliator is fitted, we can obtain the mapping matrix. Note, that now mapping matrix contains reconciliation proportions, and not only zeros and ones.

Columns of the top-down mapping matrix correspond to segments at the source level of the hierarchy, and rows to the segments at the target level.

[29]:

ahp_reconciliator.fit(ts=hierarchical_ts)
ahp_reconciliator.mapping_matrix.toarray()

[29]:

array([[0.29517217, 0.        , 0.        , 0.        ],
       [0.17522331, 0.        , 0.        , 0.        ],
       [0.29906179, 0.        , 0.        , 0.        ],
       [0.06509802, 0.        , 0.        , 0.        ],
       [0.10138424, 0.        , 0.        , 0.        ],
       [0.0348691 , 0.        , 0.        , 0.        ],
       [0.02919136, 0.        , 0.        , 0.        ],
       [0.        , 0.35663824, 0.        , 0.        ],
       [0.        , 0.19596791, 0.        , 0.        ],
       [0.        , 0.25065754, 0.        , 0.        ],
       [0.        , 0.06313639, 0.        , 0.        ],
       [0.        , 0.09261382, 0.        , 0.        ],
       [0.        , 0.02383924, 0.        , 0.        ],
       [0.        , 0.01714686, 0.        , 0.        ],
       [0.        , 0.        , 0.29766462, 0.        ],
       [0.        , 0.        , 0.16667059, 0.        ],
       [0.        , 0.        , 0.27550314, 0.        ],
       [0.        , 0.        , 0.0654707 , 0.        ],
       [0.        , 0.        , 0.13979554, 0.        ],
       [0.        , 0.        , 0.0245672 , 0.        ],
       [0.        , 0.        , 0.03032821, 0.        ],
       [0.        , 0.        , 0.        , 0.29191277],
       [0.        , 0.        , 0.        , 0.15036933],
       [0.        , 0.        , 0.        , 0.25667986],
       [0.        , 0.        , 0.        , 0.09445469],
       [0.        , 0.        , 0.        , 0.1319362 ],
       [0.        , 0.        , 0.        , 0.03209989],
       [0.        , 0.        , 0.        , 0.04254726]])

Let’s fit HierarchicalPipeline with AHP method.

[30]:

reconciliator = TopDownReconciliator(target_level="region_level", source_level="reason_level", method="AHP", period=9)

pipeline = HierarchicalPipeline(
    transforms=[
        LagTransform(in_column="target", lags=[1, 2, 3, 4, 6, 12]),
    ],
    model=LinearPerSegmentModel(),
    reconciliator=reconciliator,
)

ahp_metrics, _, _ = pipeline.backtest(ts=hierarchical_ts, metrics=[SMAPE()], n_folds=3, aggregate_metrics=True)
ahp_metrics = ahp_metrics.set_index("segment").add_suffix("_ahp")

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.5s finished

To use another top-down proportion estimation method pass different method name to the TopDownReconciliator constructor. Let’s take a look at the PHA method.

[31]:

pha_reconciliator = TopDownReconciliator(
    target_level="region_level", source_level="reason_level", method="PHA", period=6
)

It should be noted that the fitted mapping matrix has the same structure as in the previous method, but with slightly different non-zero values.

[32]:

pha_reconciliator.fit(ts=hierarchical_ts)
pha_reconciliator.mapping_matrix.toarray()

[32]:

array([[0.29761574, 0.        , 0.        , 0.        ],
       [0.17910202, 0.        , 0.        , 0.        ],
       [0.29400697, 0.        , 0.        , 0.        ],
       [0.0651224 , 0.        , 0.        , 0.        ],
       [0.10000206, 0.        , 0.        , 0.        ],
       [0.03596948, 0.        , 0.        , 0.        ],
       [0.02818132, 0.        , 0.        , 0.        ],
       [0.        , 0.35710317, 0.        , 0.        ],
       [0.        , 0.19744442, 0.        , 0.        ],
       [0.        , 0.24879185, 0.        , 0.        ],
       [0.        , 0.06362301, 0.        , 0.        ],
       [0.        , 0.09206311, 0.        , 0.        ],
       [0.        , 0.02404128, 0.        , 0.        ],
       [0.        , 0.01693316, 0.        , 0.        ],
       [0.        , 0.        , 0.29730368, 0.        ],
       [0.        , 0.        , 0.16779538, 0.        ],
       [0.        , 0.        , 0.27544335, 0.        ],
       [0.        , 0.        , 0.06506127, 0.        ],
       [0.        , 0.        , 0.139399  , 0.        ],
       [0.        , 0.        , 0.02441176, 0.        ],
       [0.        , 0.        , 0.03058557, 0.        ],
       [0.        , 0.        , 0.        , 0.28940705],
       [0.        , 0.        , 0.        , 0.14772684],
       [0.        , 0.        , 0.        , 0.26106345],
       [0.        , 0.        , 0.        , 0.09481879],
       [0.        , 0.        , 0.        , 0.13193001],
       [0.        , 0.        , 0.        , 0.03034655],
       [0.        , 0.        , 0.        , 0.04470731]])

Let’s fit HierarchicalPipeline with PHA method.

[33]:

reconciliator = TopDownReconciliator(target_level="region_level", source_level="reason_level", method="PHA", period=9)

pipeline = HierarchicalPipeline(
    transforms=[
        LagTransform(in_column="target", lags=[1, 2, 3, 4, 6, 12]),
    ],
    model=LinearPerSegmentModel(),
    reconciliator=reconciliator,
)

pha_metrics, _, _ = pipeline.backtest(ts=hierarchical_ts, metrics=[SMAPE()], n_folds=3, aggregate_metrics=True)
pha_metrics = pha_metrics.set_index("segment").add_suffix("_pha")

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.4s finished

Finally, let’s forecast the middle level series directly.

[34]:

region_level_ts = hierarchical_ts.get_level_dataset(target_level="region_level")

pipeline = Pipeline(
    transforms=[
        LagTransform(in_column="target", lags=[1, 2, 3, 4, 6, 12]),
    ],
    model=LinearPerSegmentModel(),
)

region_level_metric, _, _ = pipeline.backtest(ts=region_level_ts, metrics=[SMAPE()], n_folds=3, aggregate_metrics=True)

region_level_metric = region_level_metric.set_index("segment").add_suffix("_region_level")

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.8s finished

Now we can take a look at metrics and compare methods.

[35]:

all_metrics = pd.concat([bottom_up_metrics, ahp_metrics, pha_metrics, region_level_metric], axis=1)

all_metrics

[35]:

	SMAPE_bottom_up	SMAPE_ahp	SMAPE_pha	SMAPE_region_level
segment
Bus_NSW	5.270422	6.519390	6.318020	8.002023
Bus_NT	25.765018	15.154473	14.734894	35.648559
Bus_QLD	18.254162	3.727278	3.843837	5.920212
Bus_SA	15.282322	18.196766	18.443477	17.586339
Bus_TAS	30.695013	25.932555	25.145120	8.810328
Bus_VIC	15.116212	4.755657	4.252078	10.312053
Bus_WA	10.009304	18.514307	18.415316	10.715275
Hol_NSW	14.454165	7.705629	8.011244	9.115648
Hol_NT	53.250687	44.949294	46.821349	17.153756
Hol_QLD	9.624166	8.647920	7.722205	11.364234
Hol_SA	8.202269	20.085900	19.786931	11.244287
Hol_TAS	51.592386	50.644414	51.205854	55.117682
Hol_VIC	7.125269	17.980484	20.270132	21.994822
Hol_WA	16.415138	13.303132	13.703019	25.802063
Oth_NSW	29.987238	35.335283	35.113979	22.802959
Oth_NT	98.032493	50.694763	55.755842	48.984850
Oth_QLD	31.464303	26.668852	27.804644	14.136124
Oth_SA	24.098806	41.848523	41.911698	22.057562
Oth_TAS	55.187208	46.457792	44.252704	23.528327
Oth_VIC	31.365795	37.310906	36.372753	25.495443
Oth_WA	26.894592	23.561252	26.071981	25.078132
VFR_NSW	4.977585	7.088159	7.067566	8.696804
VFR_NT	46.565888	28.796286	29.001835	35.465418
VFR_QLD	12.675037	4.312979	4.370722	4.169244
VFR_SA	15.613376	19.780459	20.278122	24.620504
VFR_TAS	33.182773	26.505685	29.206359	28.587697
VFR_VIC	9.237164	10.549981	10.226061	21.911153
VFR_WA	17.416115	12.329126	11.702146	3.941069

[36]:

all_metrics.mean()

[36]:

SMAPE_bottom_up       25.634104
SMAPE_ahp             22.405616
SMAPE_pha             22.778925
SMAPE_region_level    19.937949
dtype: float64

The results presented above show that using reconciliation methods can improve forecasting quality for some segments. In this particular case, the direct forecast for segments at the Reason level is slightly better on average than the reconciliation forecasts.

4. Exogenous variables for hierarchical forecasts¶

This section shows how exogenous variables can be added to a hierarchical TSDataset.

Before adding exogenous variables to the dataset, we should decide at which level we should place them. Model fitting and initial forecasting in the HierarchicalPipeline are made at the source level. So exogenous variables should be at the source level as well.

Let’s try to add monthly indicators to our model.

[37]:

from etna.datasets.utils import duplicate_data

horizon = 3
exog_index = pd.date_range("2006-01-01", periods=periods + horizon, freq="MS")

months_df = pd.DataFrame({"timestamp": exog_index.values, "month": exog_index.month.astype("category")})

reason_level_segments = hierarchical_ts.hierarchical_structure.get_level_segments(level_name="reason_level")

[38]:

months_ts = duplicate_data(df=months_df, segments=reason_level_segments)
months_ts.head()

[38]:

segment	Bus	Hol	Oth	VFR
feature	month	month	month	month
timestamp
2006-01-01	1	1	1	1
2006-02-01	2	2	2	2
2006-03-01	3	3	3	3
2006-04-01	4	4	4	4
2006-05-01	5	5	5	5

Previous block showed how to create exogenous variables and convert to a hierarchical format manually. Another way to convert exogenous variables to a hierarchical dataset is to use TSDataset.to_hierarchical_dataset.

First, let’s convert the dataframe to hierarchical long format.

[39]:

months_ts = duplicate_data(df=months_df, segments=reason_level_segments, format="long")
months_ts.rename(columns={"segment": "reason"}, inplace=True)
months_ts.head()

[39]:

	timestamp	month	reason
0	2006-01-01	1	Hol
1	2006-02-01	2	Hol
2	2006-03-01	3	Hol
3	2006-04-01	4	Hol
4	2006-05-01	5	Hol

Now we are ready to use to_hierarchical_dataset method. When using this method with exogenous data pass return_hierarchy=False, because we want to use hierarchical structure from target variables. Setting keep_level_columns=True will add level columns to the result dataframe.

[40]:

months_ts, _ = TSDataset.to_hierarchical_dataset(df=months_ts, level_columns=["reason"], return_hierarchy=False)
months_ts.head()

[40]:

segment	Bus	Hol	Oth	VFR
feature	month	month	month	month
timestamp
2006-01-01	1	1	1	1
2006-02-01	2	2	2	2
2006-03-01	3	3	3	3
2006-04-01	4	4	4	4
2006-05-01	5	5	5	5

When dataframe with exogenous variables is prepared, create new hierarchical dataset with exogenous variables.

[41]:

hierarchical_ts_w_exog = TSDataset(
    df=hierarchical_df,
    df_exog=months_ts,
    hierarchical_structure=hierarchical_structure,
    freq="MS",
    known_future="all",
)

[42]:

f"df_level={hierarchical_ts_w_exog.current_df_level}, df_exog_level={hierarchical_ts_w_exog.current_df_exog_level}"

[42]:

'df_level=city_level, df_exog_level=reason_level'

Here we can see different levels for the dataframes inside the dataset. In such case exogenous variables wouldn’t be merged to target variables.

[43]:

hierarchical_ts_w_exog.head()

[43]:

segment	Bus_NSW_city	Bus_NSW_noncity	Bus_NT_city	Bus_NT_noncity	Bus_QLD_city	Bus_QLD_noncity	Bus_SA_city	Bus_SA_noncity	Bus_TAS_city	Bus_TAS_noncity	Bus_VIC_city	Bus_VIC_noncity	Bus_WA_city	Bus_WA_noncity	Hol_NSW_city	Hol_NSW_noncity	Hol_NT_city	Hol_NT_noncity	Hol_QLD_city	Hol_QLD_noncity	Hol_SA_city	Hol_SA_noncity	Hol_TAS_city	Hol_TAS_noncity	Hol_VIC_city	Hol_VIC_noncity	Hol_WA_city	Hol_WA_noncity	Oth_NSW_city	Oth_NSW_noncity	Oth_NT_city	Oth_NT_noncity	Oth_QLD_city	Oth_QLD_noncity	Oth_SA_city	Oth_SA_noncity	Oth_TAS_city	Oth_TAS_noncity	Oth_VIC_city	Oth_VIC_noncity	Oth_WA_city	Oth_WA_noncity	VFR_NSW_city	VFR_NSW_noncity	VFR_NT_city	VFR_NT_noncity	VFR_QLD_city	VFR_QLD_noncity	VFR_SA_city	VFR_SA_noncity	VFR_TAS_city	VFR_TAS_noncity	VFR_VIC_city	VFR_VIC_noncity	VFR_WA_city	VFR_WA_noncity
feature	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target	target
timestamp
2006-01-01	1201	1684	136	80	1111	982	388	456	116	107	1164	984	532	874	3096	14493	101	86	4688	4390	888	2201	619	1483	2531	7881	1383	2066	396	510	35	8	431	271	244	73	76	24	181	286	168	37	2709	6689	28	9	3003	2287	1324	869	602	748	2565	3428	1019	762
2006-02-01	2020	2281	138	170	776	1448	346	403	83	290	1014	811	356	1687	1479	9548	201	297	2320	3990	521	1414	409	689	1439	4586	1059	1395	657	581	69	39	669	170	142	221	36	61	229	323	170	99	2184	5645	168	349	1957	2945	806	639	257	266	1852	2255	750	603
2006-03-01	1975	2118	452	1084	1079	2300	390	360	196	107	1153	791	440	1120	1609	7301	619	745	4758	6975	476	1093	127	331	1488	3572	1101	2297	540	893	150	338	270	1164	397	315	32	23	128	318	380	1166	2225	5052	390	84	2619	2870	1078	375	130	261	1882	1929	953	734
2006-04-01	1500	1963	243	160	1128	1752	255	635	70	228	1245	508	539	1252	1520	9138	164	250	3328	4781	571	1699	371	949	1906	3575	1128	2433	745	1157	172	453	214	535	194	260	48	43	270	336	410	1139	2918	5385	244	218	2097	2344	568	641	137	257	2208	2882	999	715
2006-05-01	1196	2151	194	189	1192	1559	386	280	130	205	950	572	582	441	1958	14194	62	151	4930	5117	873	2150	523	1590	2517	8441	1560	2727	426	558	15	47	458	557	147	33	77	60	265	293	162	28	3154	7232	153	125	2703	2933	887	798	347	437	2988	3164	1396	630

Exogenous data will be merged only when both dataframes are at the same level, so we can perform reconciliation to do this. Right now, our dataset is lower, than the exogenous variables, so they aren’t merged. To perform aggregation to higher levels, we can use get_level_dataset method.

[44]:

hierarchical_ts_w_exog.get_level_dataset(target_level="reason_level").head()

[44]:

segment	Bus		Hol		Oth		VFR
feature	month	target	month	target	month	target	month	target
timestamp
2006-01-01	1	9815.0	1	45906.0	1	2740.0	1	26042.0
2006-02-01	2	11823.0	2	29347.0	2	3466.0	2	20676.0
2006-03-01	3	13565.0	3	32492.0	3	6114.0	3	20582.0
2006-04-01	4	11478.0	4	31813.0	4	5976.0	4	21613.0
2006-05-01	5	10027.0	5	46793.0	5	3126.0	5	26947.0

The modeling process stays the same as in the previous cases.

[45]:

region_level_ts_w_exog = hierarchical_ts_w_exog.get_level_dataset(target_level="region_level")

pipeline = HierarchicalPipeline(
    transforms=[
        OneHotEncoderTransform(in_column="month"),
        LagTransform(in_column="target", lags=[1, 2, 3, 4, 6, 12]),
    ],
    model=LinearPerSegmentModel(),
    reconciliator=TopDownReconciliator(
        target_level="region_level", source_level="reason_level", period=9, method="AHP"
    ),
)

metric, _, _ = pipeline.backtest(ts=region_level_ts_w_exog, metrics=[SMAPE()], n_folds=3, aggregate_metrics=True)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.8s finished