Cache layout#
hl-research stores all pulled data in a local Parquet cache. The cache is the only thing that touches disk.
Location#
XDG default:
- macOS / Linux:
~/.cache/hl-research/ - Override with
--cache-dir PATHon any command, or setHLR_CACHE_DIR.
$HLR_CACHE_DIR/
├── meta.duckdb # DuckDB: sync_state, assets, vaults tables
└── data/
├── candles/asset=BTC/interval=1h/2024-01.parquet
├── funding/asset=BTC/2024.parquet
├── fills/wallet=0xabc.../2024-Q1.parquet
└── vaults/address=0xvault.../trades.parquet
Partitioning#
| Kind | Partition | Reason |
|---|---|---|
| Candles | asset → interval → month |
Monthly files keep candle pulls bounded |
| Funding | asset → year |
One file per year is small enough |
| Fills | wallet → quarter |
Wallets vary widely in fill volume |
| Vault trades | address → flat |
Vaults are individually small |
Parquet files are immutable. Resyncing appends new files; partitions are rewritten only when a duplicate timestamp is present.
DuckDB metadata#
meta.duckdb carries three small tables:
CREATE TABLE sync_state (
kind TEXT, entity TEXT, interval TEXT,
since_ms BIGINT, until_ms BIGINT,
row_count BIGINT, updated_at TIMESTAMP,
PRIMARY KEY (kind, entity, interval)
);
CREATE TABLE assets (
name TEXT PRIMARY KEY,
sz_decimals INTEGER, max_leverage INTEGER, only_isolated BOOLEAN,
fetched_at TIMESTAMP
);
CREATE TABLE vaults (
address TEXT PRIMARY KEY,
name TEXT, leader TEXT, tvl DOUBLE, apr DOUBLE, is_closed BOOLEAN,
fetched_at TIMESTAMP
);
sync_state is the source of truth for "what's been pulled and through when." Every incremental pull reads it to find a resume point, then writes back after the pull completes.
Reading from Python#
The Cache class is the public interface:
from hl_research.cache.store import Cache
cache = Cache() # or Cache(Path("/tmp/hlr-cache"))
candles = cache.read_candles("BTC", "1h") # pl.LazyFrame
funding = cache.read_funding("BTC") # pl.LazyFrame
fills = cache.read_fills("0xabc...") # pl.LazyFrame
vaults = cache.read_vault_summaries() # pl.DataFrame
entries = cache.list() # list[CacheEntry]
LazyFrame returns let you compose without materializing until necessary. The data/ module exposes higher-level views that filter and join across these primitives.
Size estimates#
Order of magnitude per year of pulls:
| Data | Compressed Parquet |
|---|---|
| Candles, 1h, one asset | ~300 KB |
| Candles, 1m, one asset | ~18 MB |
| Funding, one asset | ~50 KB |
| Wallet fills, active trader | 1–10 MB |
| Vault trades, large vault | 5–50 MB |
A modest research cache covering top-20 assets at 1h candles + funding for two years runs around 50 MB total.
Clearing#
hlr data clear --asset BTC # all BTC data
hlr data clear --kind funding # all funding data
hlr data clear --asset BTC --kind candles # BTC candles only
Requires at least one scope flag. There is no --all; that's intentional.