Skip to content

Commit 6a0f301

Browse files
authored
Create DATA_DESCRIPTOR.md
1 parent 97c310b commit 6a0f301

File tree

1 file changed

+140
-0
lines changed

1 file changed

+140
-0
lines changed

docs/DATA_DESCRIPTOR.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# ARGOS Synthetic Hotel Optimization Datasets — Data Descriptor
2+
3+
## Background & Summary
4+
5+
The ARGOS (Adaptive Recursive Gradient Optimization System) framework integrates
6+
Lexicographic Constraint Optimization (LCO) with a Componentwise Approximated
7+
Gradient (CAG) filter to enable stable, lexicographically safe optimization in
8+
hierarchical decision-making problems. To support reproducibility and to
9+
facilitate independent evaluation of ARGOS, we release a suite of fully
10+
synthetic datasets that emulate hotel and multi-unit management scenarios under
11+
varied operating conditions.
12+
13+
The datasets capture key dimensions of hotel operations, including occupancy,
14+
staffing levels, staff fatigue, and revenue per available room (RevPAR), as well
15+
as higher-level constructs such as scenario volatility, resource constraints, and
16+
multi-property traffic patterns. No real hotel operational data or personal data
17+
are used: all records are synthesized from a controlled stochastic simulator.
18+
19+
These datasets are intended as a reproducible testbed for:
20+
21+
- Lexicographic optimization under strict Tier-1 “feasibility” constraints,
22+
- Evaluation of componentwise gradient filtering (CAG),
23+
- Benchmarking of ARGOS against baseline optimization methods, and
24+
- Future extensions to QUBO/quantum-hybrid hotel management formulations.
25+
26+
## Methods
27+
28+
### Simulation Framework
29+
30+
All datasets are generated using a stylized CMDP-like hotel environment
31+
implemented within the ARGOS codebase. The environment describes a single hotel
32+
(or a collection of hotels) through a state vector including normalized
33+
occupancy, staff level, staff fatigue index, and pricing/revenue variables.
34+
35+
At each simulated time step, the environment evolves according to:
36+
37+
- deterministic dynamics capturing baseline demand and staffing trends,
38+
- stochastic noise terms representing unmodeled variability,
39+
- scenario-specific modifications (e.g., increased volatility or reduced staff).
40+
41+
ARGOS and baseline controllers produce candidate actions (price adjustments,
42+
staffing decisions, or control signals), which are mapped to state transitions.
43+
For the released datasets, the underlying policy is fixed and the primary focus
44+
is on the resulting trajectories rather than policy optimization itself.
45+
46+
### Scenarios
47+
48+
We provide several distinct scenario families:
49+
50+
- **Long-horizon baseline:** 365-day single-hotel operation under moderate
51+
noise, used to study stability and convergence.
52+
53+
- **High-volatility scenario:** 180-day simulation with amplified noise on
54+
occupancy, fatigue, and RevPAR to test robustness under non-stationary,
55+
high-variance conditions.
56+
57+
- **Staff-shortage scenario:** 180-day simulation where staff levels are
58+
systematically reduced and fatigue is typically elevated, stressing Tier-1
59+
feasibility (minimum staff) and Tier-3 “staff well-being” priorities.
60+
61+
- **Multi-unit traffic:** 100-day simulation of booking traffic across multiple
62+
hotel units, approximating heterogeneous demand across properties.
63+
64+
- **Hyperparameter sweep:** summary statistics across varying step sizes and
65+
CAG weighting coefficients, illustrating the sensitivity of performance and
66+
Tier-1 violations to hyperparameter choices.
67+
68+
- **QUBO example:** a small random QUBO matrix for demonstration of
69+
binary-optimization interfaces; no direct trajectory data is associated.
70+
71+
## Data Records
72+
73+
All datasets are provided as CSV files in the `data/` directory of the ARGOS
74+
repository (and mirrored in the Zenodo deposition). The main files are:
75+
76+
1. `synthetic_long_horizon.csv` — 365 daily records of a single-hotel
77+
environment with columns: `day`, `occupancy`, `fatigue`, `staff_level`,
78+
`revpar`.
79+
80+
2. `scenario_high_volatility.csv` — 180 daily records under increased
81+
volatility with columns: `day`, `occupancy`, `fatigue`, `revpar`.
82+
83+
3. `scenario_staff_shortage.csv` — 180 daily records under staff-shortage
84+
stress with columns: `day`, `occupancy`, `fatigue`, `staff_level`, `revpar`.
85+
86+
4. `hyperparam_sweep_results.csv` — summary statistics for combinations of
87+
step size `alpha` and `cag_weight`, with columns: `alpha`, `cag_weight`,
88+
`avg_revpar`, `violations_tier1`, `fatigue_mean`.
89+
90+
5. `qubo_example_matrix.csv` — an 8×8 QUBO coefficient matrix, stored in
91+
wide form with each row representing one dimension of the binary quadratic
92+
form.
93+
94+
6. `multiunit_traffic_sim.csv` — 100-day multi-unit booking traffic simulation
95+
with columns: `day`, `hotel_0_traffic``hotel_4_traffic`.
96+
97+
Each file is accompanied by a data dictionary (see below), describing the
98+
semantic meaning, type, and range of each column.
99+
100+
## Technical Validation
101+
102+
Because the data are synthetic, validation focuses on internal consistency and
103+
plausibility rather than on comparison with an external ground truth.
104+
105+
- **Internal consistency:** the simulator enforces reasonable bounds on
106+
occupancy (0–1), staff levels (0–1), and fatigue indices (0–1). RevPAR values
107+
follow plausible distributions for mid-tier hotels but do not reproduce any
108+
specific operator’s financials.
109+
110+
- **Scenario behavior:** high-volatility scenarios show visibly increased
111+
variance in occupancy and revenue; staff-shortage scenarios show lower average
112+
staff levels and generally higher fatigue. These patterns were inspected
113+
visually via time-series plots and summary statistics.
114+
115+
- **Hyperparameter sensitivity:** the hyperparameter sweep dataset is generated
116+
using repeated runs with fixed seeds, ensuring stable comparisons between
117+
configurations while still reflecting stochastic variability within runs.
118+
119+
- **Reproducibility:** the Python scripts and notebooks used to generate these
120+
datasets are included in the ARGOS repository, enabling full regeneration
121+
under controlled seeds.
122+
123+
## Usage Notes
124+
125+
The datasets are designed for:
126+
127+
- Reproducing the experiments reported in the ARGOS paper,
128+
- Extending the analysis with additional baselines or ablations,
129+
- Serving as controlled environments for studying lexicographic / hierarchical
130+
optimization techniques.
131+
132+
Users should note that the datasets:
133+
134+
- Do not represent any specific real hotel or chain,
135+
- Should not be used for financial forecasting or business decisions,
136+
- May be adapted or extended by modifying the ARGOS simulation code and
137+
regenerating trajectories with different parameter choices.
138+
139+
When using these datasets in publications or derived work, please cite the ARGOS
140+
paper and (optionally) the Zenodo dataset DOI.

0 commit comments

Comments
 (0)