Skip to content

Cache layout#

hl-research stores all pulled data in a local Parquet cache. The cache is the only thing that touches disk.

Location#

XDG default:

  • macOS / Linux: ~/.cache/hl-research/
  • Override with --cache-dir PATH on any command, or set HLR_CACHE_DIR.
$HLR_CACHE_DIR/
├── meta.duckdb     # DuckDB: sync_state, assets, vaults tables
└── data/
    ├── candles/asset=BTC/interval=1h/2024-01.parquet
    ├── funding/asset=BTC/2024.parquet
    ├── fills/wallet=0xabc.../2024-Q1.parquet
    └── vaults/address=0xvault.../trades.parquet

Partitioning#

Kind Partition Reason
Candles assetinterval → month Monthly files keep candle pulls bounded
Funding asset → year One file per year is small enough
Fills wallet → quarter Wallets vary widely in fill volume
Vault trades address → flat Vaults are individually small

Parquet files are immutable. Resyncing appends new files; partitions are rewritten only when a duplicate timestamp is present.

DuckDB metadata#

meta.duckdb carries three small tables:

CREATE TABLE sync_state (
    kind TEXT, entity TEXT, interval TEXT,
    since_ms BIGINT, until_ms BIGINT,
    row_count BIGINT, updated_at TIMESTAMP,
    PRIMARY KEY (kind, entity, interval)
);

CREATE TABLE assets (
    name TEXT PRIMARY KEY,
    sz_decimals INTEGER, max_leverage INTEGER, only_isolated BOOLEAN,
    fetched_at TIMESTAMP
);

CREATE TABLE vaults (
    address TEXT PRIMARY KEY,
    name TEXT, leader TEXT, tvl DOUBLE, apr DOUBLE, is_closed BOOLEAN,
    fetched_at TIMESTAMP
);

sync_state is the source of truth for "what's been pulled and through when." Every incremental pull reads it to find a resume point, then writes back after the pull completes.

Reading from Python#

The Cache class is the public interface:

from hl_research.cache.store import Cache

cache = Cache()  # or Cache(Path("/tmp/hlr-cache"))
candles = cache.read_candles("BTC", "1h")     # pl.LazyFrame
funding = cache.read_funding("BTC")           # pl.LazyFrame
fills = cache.read_fills("0xabc...")          # pl.LazyFrame
vaults = cache.read_vault_summaries()         # pl.DataFrame
entries = cache.list()                        # list[CacheEntry]

LazyFrame returns let you compose without materializing until necessary. The data/ module exposes higher-level views that filter and join across these primitives.

Size estimates#

Order of magnitude per year of pulls:

Data Compressed Parquet
Candles, 1h, one asset ~300 KB
Candles, 1m, one asset ~18 MB
Funding, one asset ~50 KB
Wallet fills, active trader 1–10 MB
Vault trades, large vault 5–50 MB

A modest research cache covering top-20 assets at 1h candles + funding for two years runs around 50 MB total.

Clearing#

hlr data clear --asset BTC                    # all BTC data
hlr data clear --kind funding                 # all funding data
hlr data clear --asset BTC --kind candles     # BTC candles only

Requires at least one scope flag. There is no --all; that's intentional.