A point-in-time source pried from behind a wall, the self-correcting pipeline that grades it, and the hypotheses we ran, mapped to the assets they'd trade.
Almost any alt-data story sounds sound. Few survive a costed backtest. The hard part isn't finding data, it's not fooling yourself.
Any signal can be argued past a skeptic. A won argument is not evidence of edge.
Design-only logic and expected outcomes get mistaken for empirical results.
One confident-but-wrong call becomes a permanent line in the traded book.
One source cleared the gauntlet to investigate: a detector you validate for free, the tradeable edge a funded panel away. Here it is, and the capability behind the hunt.
The data sits at step 1, the front of the chain. Count dispensed units as they happen and you read the quarter before the company reports it.
NDC-level dispensed units & script counts. Free tier: CMS Medicaid SDUD + Medicare Part D, quarterly, public, ~1991 to now.
Map NDC → labeler → manufacturer → ticker via the free FDA NDC Directory: drug codes become "which public company."
Free CMS = the government-insured slice, quarterly, ~160-day lag. Paid (IQVIA / Symphony) = full population, WEEKLY, ~$300k–$1M+/yr: the only path to a true lead.
What's unique isn't one feed, it's reaching feeds that exist but are walled. One toolkit clears the challenge, resolves entities, and streams a clean point-in-time panel.
WAF / Akamai / Cloudflare / login walls: solve once in a stealth browser, replay the cookie via curl, verify you got the file not the login page.
Map free-text names (participants, owners, consignees) to tickers / CIKs / LEIs against a point-in-time list. Under-match over false-match.
Build a PIT dataframe from huge zip / gz / csv with no OOM; timestamp by a public-on-creation field, never a restated one.
Hit a blocker? Try an alt path, an archived mirror, a narrower scope. It concedes to design-only only when the data is truly out of reach.
From a blank prompt to a graded, tradeable signal, built so that no grade can outrun its evidence.
Three chained LLM-agent workflows. The pipeline doesn't follow discovery and validation. It drives them, wrapped with self-correction and portfolio synthesis.
Re-tests past verdicts on new data; tracks its own wash-out rate and discounts fresh optimism by it.
On a data wall it retries alt paths / mirrors / narrower scope before conceding, turning design-only into executed.
Family-wise FDR across the run demotes investigate→monitor when a result won't survive correction.
Every verdict recorded, so the system can grade whether its own past calls turned out right.
A validated-log (do-not-rerun) and a targets ledger stop it re-pitching ideas or KPIs that already washed out.
No grade may outrun its evidence: an "A" needs tests that ran; design-only is capped and kept out of the math.
When a round can't reach N even if every candidate passed, the next sweep launches alongside the current validation, hiding discovery latency instead of paying it serially.
Every sweep and validation retries on a transient crash, riding out the API-overload windows that used to kill a whole run.
A failed validation's own diagnostics (what failed, the confounds, next steps) refine the source into a better spec, re-validated and deflated for the retry.
Ten signals run end-to-end. Each one is a falsifiable bet on a specific set of names: here's the bet, the names, and the honest result.
| Signal | Hypothesis (the bet) | Specific assets | Result |
|---|---|---|---|
| HKEX CCASS custody-drift | Silent custody concentration → forward multi-week drawdown | Low-float HK GEM micro-caps (08003 / 08160 …) · SHORT | monitor |
| USPTO trademark breadth | New Nice-class breadth → next-year cross-sectional returns | 3,931 US public firms · L/S | monitor |
| US customs / bill-of-lading | Inbound manifest volume → COGS / revenue, pre-disclosure | Ocean-import-heavy single names (retail / hardgoods) | monitor |
| Dubai DLD Oqood feed | Off-plan registrations nowcast developer bookings KPI | Emaar Development (DFM: EMAARDEV) | monitor |
| Grid-OEM book-to-bill | Backlog-margin surprise → forward-margin revision drift | GE Vernova · Siemens Energy · Hitachi · Eaton | monitor |
| FCHLPM cat-model docket | Certified loss-cost revision → FL property repricing | FL P&C / reinsurers (UVE · HCI · HRTG) | monitor |
| Filings-text (Lazy Prices) | YoY 10-K/10-Q text change → underreaction drift | US public equities · L/S | monitor |
| Data-center thermal-IR | Waste-heat ramp → energization, tradeable on announce | Data-center operators / REITs · shortable names | monitor |
| USPTO PTAB IPR velocity | Offensive IPR petition → exclusivity loss, bearish lead | Single-franchise biopharma (Orange-Book issuers) | reject |
| LME cancelled-warrant flow | Cancellation surprise → forward time-spread | Base metals Cu / Al / Zn / Ni · futures | reject |
Congressional-leadership trades. Clean, point-in-time detector, but the 30–45 day STOCK-Act filing lag prices the edge in before T0. The post-disclosure null holds across House, Senate, and the broad pool.
LME cancelled-warrants. The "tightness" surprise is wrong-signed and anti-predicts its own physical depletion in 5 of 6 metals. PTAB IPR: filing returns are significantly positive, not bearish.
Box office (Comscore). Opening-weekend surprises do not move diversified-studio returns across 52 films. The story is real; the tradeable signal is not.
AirDNA short-term-rental. The lodging-REIT thesis inverts in sign; the history is retroactively restated with no as-of snapshot, so any backtest is fiction.
GitHub / OSS telemetry. 47% of the stars on the most-starred repo are fake. The signal's own input is gamed, so adoption cannot be read from it.
RF satellite (HawkEye 360): customers are gov / defense only. Earned-Wage-Access: collapses to a plain CHYM / GDOT revenue read, nothing distinct.
Across ten sound, web-verified mechanisms, zero reached funded-build. Not for lack of imagination, but on point-in-time data access: paywalled revisions, no PIT history, survivorship-purged universes, endogenous opt-outs.