Data Foundation Strategy

Source: docs/proposals/data-foundation-strategy.md

# RGL8R Data Foundation Strategy
 
**Author:** Claude Code
**Date:** February 2026
**Status:** Draft for Discussion
 
---
 
## Executive Summary
 
The RGL8R platform's value proposition depends on comprehensive, accurate regulatory data. The current MVP has ~8 SIMA measures, 2 carriers, and a handful of HS mappings—enough to demonstrate the concept but nowhere near production-ready.
 
**Thesis:** RGL8R is a **regulatory data platform with compliance applications on top**. The data layer is the product; the app is the view and workflow surface.
 
**The hard truth:** Without a scalable data strategy, we're building a Ferrari engine with no fuel system.
 
This proposal outlines a phased approach to building the data foundation, with options ranging from "bootstrap lean" to "enterprise-grade."
 
### Core Principles
 
- **Provenance:** every row stores `source_url`, `source_id`, `source_publication_date`, `source_effective_date`, and `retrieval_date`.
- **Temporal validity:** `effective_from` / `effective_to` on everything.
- **Determinism:** decisions are pinned to a specific **data release** (e.g., `CA-2026.02`).
- **Coverage transparency:** every dataset has an explicit **coverage tier** (ALPHA/BETA/VERIFIED).
- **Human-in-the-loop:** AI drafts, humans approve when stakes are high.
 
---
 
## 1. The Data Problem
 
### What We Have Today
 
| Data Type | Current State | Coverage |
|-----------|--------------|----------|
| SIMA measures | 8 measures hardcoded | ~15% of active CBSA measures |
| Carrier claim rules | 2 carriers (UPS, FedEx) | ~5% of parcel/freight carriers |
| HS mappings | 7 prefixes with rules | <1% of HTS schedule |
| Tariff rates | None | 0% |
| Country regulations | None | 0% |
| Sanctions/denied parties | None | 0% |
| FTA/preferential programs | None | 0% |
 
### What a Production System Needs
 
| Data Type | Scope | Update Frequency | Complexity |
|-----------|-------|------------------|------------|
| **HTS Tariff Schedule** | ~18,000 line items (10-digit) per country | Annual + interim updates | Medium |
| **AD/CVD Orders** (SIMA, US 337, etc.) | ~500 active orders globally | Monthly | High |
| **Carrier Claim Rules** | 50+ carriers × claim types × geographies | Quarterly | Medium |
| **FTA Rules of Origin** | 20+ trade agreements × product categories | Rare but complex | Very High |
| **Sanctions Lists** | OFAC, EU, UN, etc. (~10 lists) | Daily | Low (structured) |
| **Country Import Requirements** | 195 countries × product categories | Varies | Very High |
| **Duty Calculation Rules** | Ad valorem, specific, compound, TRQs | Per schedule update | High |
 
### The Maintenance Burden
 
This isn't a "build once" problem. Regulations change constantly:
 
- **CBSA SIMA:** New cases initiated, sunset reviews, scope rulings
- **US Tariffs:** Section 301, Section 232, exclusions, extensions
- **HTS Updates:** Annual revisions, mid-year amendments
- **Carrier Policies:** Claim windows change, new surcharge types
- **Sanctions:** Daily additions/removals
 
**Rough estimate:** Maintaining comprehensive trade data requires 0.5-1 FTE of ongoing curation work, or significant licensing costs.
 
---
 
## 2. Strategic Options
 
### Option A: License Commercial Data
 
**Providers:**
- **Descartes/MK Data** - HTS, duty rates, FTAs, denied party screening
- **Avalara CrossBorder** - Landed cost calculation, HS classification
- **Thomson Reuters ONESOURCE** - Global trade content
- **Integration Point (now E2open)** - Trade compliance data
 
**Pros:**
- Fast to market
- Professionally maintained
- Covers edge cases
 
**Cons:**
- $50K-500K/year depending on scope
- Dependency on vendor roadmap
- May not cover SHIP (parcel audit) use case
- Margin compression
 
**Verdict:** Good for enterprise tier, but kills unit economics for SMB.
 
### Option B: Government Source Aggregation + AI Curation
 
**Sources:**
- USITC (HTS schedule, AD/CVD orders)
- CBSA (SIMA measures, tariff schedule)
- CBP (rulings, CROSS database)
- Federal Register (new orders, scope rulings)
- WTO (AD/CVD notifications)
 
**Approach:**
1. Build ingestion pipelines from official sources
2. Use LLMs to parse unstructured documents (Federal Register notices, scope rulings)
3. Human review for high-stakes classifications
4. Version control all data changes
5. Automated monitoring for source updates
 
**Pros:**
- Low/no licensing cost
- Full control over data model
- Can differentiate on freshness/accuracy
- AI assistance scales curation
 
**Cons:**
- Significant engineering investment
- Data quality responsibility is ours
- Government sources are inconsistent
- Need domain expertise for validation
 
**Verdict:** Higher upfront cost, better long-term economics. Aligns with "software building software" philosophy.
 
### Option C: Hybrid (Recommended)
 
**Phase 1:** Bootstrap with government sources + manual curation for pilot customers
**Phase 2:** Expand coverage + release bundles + coverage tiers
**Phase 3:** AI-assisted curation pipeline + regression testing
**Phase 4:** Global expansion + selective licensed/partner data for gaps
 
---
 
## 3. Proposed Architecture
 
### 3.1 Reuse RIS Design (Archived)
 
We already have a strong Regulatory Ingestion Service (RIS) design in
`_archived/services/regulatory-ingestion-service/README.md`. The fastest,
lowest-risk path is to **revive that data model inside the monolith** as embedded
modules, not reinvent it.
 
The RIS design already covers:
- Provenance fields (`source_url`, `source_publication_date`, `source_effective_date`, `retrieval_date`)
- Change detection (SHA-256 hashing, `change_type`, `change_summary`, `previous_version_id`)
- Version history and lineage chains
- Multi-source connectors (CBSA, TARIC, UK CDS, HTSUS)
 
**Implementation note:** Keep reference data in a separate schema (e.g., `ref.*`)
to avoid mixing tenant data with global reference data.
 
**Also revive (from archived services):**
- **CRE**: 6-level rule hierarchy + inheritance + conflict resolution.
- **VAS**: dual-source validation, variance levels, immutable audit trail.
- **LCE**: landed-cost calculation model (Phase 2+).
 
### 3.2 Data Model Principles
 
```
┌─────────────────────────────────────────────────────────────┐
│                    Reference Data Layer                      │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │   Tariff    │  │   Trade     │  │   Carrier   │         │
│  │  Schedules  │  │  Remedies   │  │   Rules     │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │    FTA      │  │  Sanctions  │  │   Country   │         │
│  │   Rules     │  │   Lists     │  │   Regs      │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
├─────────────────────────────────────────────────────────────┤
│                    Versioning & Audit                        │
│  - Provenance + effective_from/effective_to                 │
│  - Release bundles (CA-2026.02) + checksums                 │
│  - Full audit trail of changes                              │
│  - Point-in-time queries ("what was the rate on X date?")   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Curation Pipeline                         │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │   Source    │  │     AI      │  │   Human     │         │
│  │  Ingestion  │──│  Extraction │──│   Review    │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│         │                │                │                  │
│    Government       Parse docs      QA + approve            │
│    APIs/feeds      Extract rules    Handle edge cases       │
│    Web scraping    Flag changes     Domain expertise        │
└─────────────────────────────────────────────────────────────┘
```
 
### Schema Design (Conceptual)
 
Each reference table includes the standard RIS provenance fields (source URLs,
publication/effective dates, retrieval dates, hashes, change summaries). Those
are omitted below for brevity.
 
```prisma
enum CoverageTier {
  ALPHA
  BETA
  VERIFIED
}
 
model DataRelease {
  id            String   @id @default(dbgenerated("gen_random_uuid()")) @db.Uuid
  releaseCode   String   // "CA-2026.02"
  jurisdiction  String
  effectiveFrom DateTime
  effectiveTo   DateTime?
  releaseNotes  String?
  checksums     Json     // Per-table checksums
 
  @@unique([releaseCode])
}
 
// Core reference tables - NOT tenant-specific
model TariffSchedule {
  id              String   @id @default(dbgenerated("gen_random_uuid()")) @db.Uuid
  country         String   // ISO 3166-1 alpha-2
  hsCode          String   // 10-digit
  description     String
  mfnRate         Decimal? // Most Favored Nation rate
  rateType        String   // ad_valorem, specific, compound
  unitOfMeasure   String?
  effectiveFrom   DateTime
  effectiveTo     DateTime?
  sourceId        String   // Link to DataSource
  version         Int
  coverageTier    CoverageTier @default(ALPHA)
  dataReleaseId   String?
 
  @@unique([country, hsCode, effectiveFrom])
  @@index([country, hsCode])
}
 
model TradeRemedy {
  id              String   @id @default(dbgenerated("gen_random_uuid()")) @db.Uuid
  country         String   // Imposing country (CA, US, EU)
  caseNumber      String   // AD/CVD case number
  caseType        String   // antidumping, countervailing, safeguard
  productScope    String   // Scope description
  originCountries String[] // Countries subject to order
  hsCodes         String[] // Covered HS codes
  dutyType        String   // ad_valorem, specific, combination
  dutyRate        Decimal?
  effectiveFrom   DateTime
  effectiveTo     DateTime?
  status          String   // active, suspended, revoked
  sourceId        String
  version         Int
  coverageTier    CoverageTier @default(ALPHA)
  dataReleaseId   String?
 
  scopeRulings    ScopeRuling[]
 
  @@index([country, status])
  @@index([hsCodes]) // Note: GIN index recommended for array fields, requires raw SQL migration
}
 
model ScopeRuling {
  id              String   @id @default(dbgenerated("gen_random_uuid()")) @db.Uuid
  remedyId        String
  remedy          TradeRemedy @relation(fields: [remedyId], references: [id])
  rulingNumber    String
  rulingDate      DateTime
  productDesc     String
  determination   String   // in_scope, out_of_scope, partial
  reasoning       String?
  sourceUrl       String?
}
 
model CarrierRule {
  id              String   @id @default(dbgenerated("gen_random_uuid()")) @db.Uuid
  carrier         String
  ruleType        String   // claim_window, surcharge, service_guarantee
  geography       String?  // null = global
  parameters      Json     // Rule-specific params
  effectiveFrom   DateTime
  effectiveTo     DateTime?
  sourceId        String
  version         Int
  coverageTier    CoverageTier @default(ALPHA)
  dataReleaseId   String?
 
  @@unique([carrier, ruleType, geography, effectiveFrom])
}
 
model DataSource {
  id              String   @id @default(dbgenerated("gen_random_uuid()")) @db.Uuid
  name            String   // "CBSA SIMA Measures in Force"
  url             String?
  sourceType      String   // government, commercial, manual
  lastFetchedAt   DateTime?
  fetchFrequency  String?  // daily, weekly, monthly
 
  @@map("data_sources")
}
 
model DataChangeLog {
  id              String   @id @default(dbgenerated("gen_random_uuid()")) @db.Uuid
  tableName       String
  recordId        String
  changeType      String   // insert, update, delete
  oldValue        Json?
  newValue        Json?
  changedBy       String   // user_id or "system:ai_pipeline"
  changedAt       DateTime @default(now())
  reviewedBy      String?
  reviewedAt      DateTime?
 
  @@index([tableName, recordId])
  @@index([changedAt])
}
 
// Regression test suite for data releases
model GoldenTestCase {
  id              String   @id @default(dbgenerated("gen_random_uuid()")) @db.Uuid
  sku             String
  hsCode          String
  origin          String
  expectedOutcome String   // AT_RISK, CLEARED, NEEDS_REVIEW
  expectedMeasure String?
  notes           String?
}
```
 
**Determinism requirement:** every screening decision (e.g., `SIMAExposure`) should
store `dataReleaseId` so outcomes are pinned to a specific release bundle.
 
### AI Curation Pipeline
 
```
┌──────────────────────────────────────────────────────────────┐
│                    Source Monitoring                          │
├──────────────────────────────────────────────────────────────┤
│  Daily checks:                                                │
│  - Federal Register for new AD/CVD notices                   │
│  - CBSA SIMA page for updates                                │
│  - USITC for HTS revisions                                   │
│  - Carrier websites for policy changes                       │
│                                                               │
│  Triggers alert when:                                         │
│  - New document published                                     │
│  - Existing page content changed (hash comparison)           │
│  - RSS/API update detected                                   │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│                    AI Extraction                              │
├──────────────────────────────────────────────────────────────┤
│  LLM tasks:                                                   │
│  - Parse Federal Register notice → structured TradeRemedy    │
│  - Extract HS codes from scope language                      │
│  - Identify affected countries                               │
│  - Extract duty rates and formulas                           │
│  - Flag ambiguous scope language for human review            │
│                                                               │
│  Confidence scoring:                                          │
│  - High confidence (>90%): auto-approve, human spot-check    │
│  - Medium (70-90%): human review required                    │
│  - Low (<70%): flag for expert review                        │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│                    Human Review Queue                         │
├──────────────────────────────────────────────────────────────┤
│  Review interface:                                            │
│  - Side-by-side: source document + extracted data            │
│  - Diff view for updates                                     │
│  - Approve / Edit / Reject actions                           │
│  - Audit trail of all decisions                              │
│                                                               │
│  Escalation:                                                  │
│  - Complex scope rulings → trade counsel review              │
│  - Rate calculations → finance review                        │
│  - New measure types → product team review                   │
└──────────────────────────────────────────────────────────────┘
```
 
---
 
## 4. Phased Roadmap
 
### Phase 1: Structured Foundation (Q1 2026)
**Goal:** Move hardcoded data to DB using RIS provenance model
 
| Task | Effort | Outcome |
|------|--------|---------|
| Adopt RIS provenance schema (`ref.*`) | 1 week | Standardized provenance + change tracking tables |
| Migrate SIMA measures to DB | 3 days | 8 measures in `trade_remedies` table |
| Migrate carrier rules to DB | 2 days | Claim windows configurable per tenant |
| Build admin API/CLI for reference data | 1 week | Internal CRUD for reference data |
| Add versioning/audit trail | 3 days | Point-in-time queries work |
 
**Deliverable:** Reference data is database-driven, not hardcoded. Updates don't require deploys.
 
### Phase 2: Expand Coverage (Q2 2026)
**Goal:** Reach North America coverage with transparent quality + release bundles
 
| Task | Effort | Outcome |
|------|--------|---------|
| Add coverage tiers (ALPHA/BETA/VERIFIED) | 2 days | Explicit data quality levels |
| Add release bundles + checksums | 3 days | Decisions pinned to `DataRelease` |
| Ingest full CBSA SIMA list | 1 week | All ~50 active measures |
| Ingest US AD/CVD orders | 2 weeks | All ~400 active orders |
| Add 10 more carriers | 1 week | Cover 80% of parcel volume |
| Build source monitoring | 1 week | Alerts on regulatory changes |
| Manual curation workflow | 1 week | Internal tool for data updates |
 
**Deliverable:** Comprehensive North America trade remedy coverage. Can serve real customers.
 
### Phase 3: AI-Assisted Curation (Q3 2026)
**Goal:** Scale data maintenance without linear headcount
 
| Task | Effort | Outcome |
|------|--------|---------|
| Golden test suite + regression harness | 1-2 weeks | Release validation before rollout |
| Federal Register parser | 2 weeks | Auto-extract from new notices |
| Scope ruling analyzer | 2 weeks | Parse rulings, update coverage |
| Change detection pipeline | 1 week | Monitor all sources continuously |
| Confidence scoring | 1 week | Route to human review appropriately |
| Review queue UI | 2 weeks | Non-technical staff can QA |
 
**Deliverable:** New regulations automatically detected and drafted. Human review for approval.
 
### Phase 4: Global Expansion (Q4 2026+)
**Goal:** Multi-country support
 
| Task | Effort | Outcome |
|------|--------|---------|
| EU trade remedies | 3 weeks | AD/CVD, safeguards |
| UK/EU tariff schedules | 2 weeks | Post-Brexit complexity |
| APAC key markets | 4 weeks | AU, JP, KR, CN |
| FTA rules of origin | 6+ weeks | USMCA, CPTPP, etc. |
| Partner data integrations | TBD | Fill gaps with licensed sources |
 
**Deliverable:** True global trade compliance platform.
 
---
 
## 5. Resource Requirements
 
### Engineering
 
| Phase | Duration | Engineers | Notes |
|-------|----------|-----------|-------|
| Phase 1 | 4 weeks | 1 | Schema + migration + basic admin |
| Phase 2 | 8 weeks | 1-2 | Data ingestion + manual tooling |
| Phase 3 | 8 weeks | 2 | AI pipeline + review workflows |
| Phase 4 | Ongoing | 2+ | Geographic expansion |
 
### Domain Expertise
 
| Role | When Needed | Options |
|------|-------------|---------|
| Trade compliance SME | Phase 2+ | Fractional consultant, $150-250/hr |
| Customs broker advisor | Phase 2+ | Partnership with broker |
| Legal review | Phase 3+ | Outside counsel for scope rulings |
 
### Data Operations
 
| Phase | Curation Effort | Approach |
|-------|-----------------|----------|
| Phase 1 | ~5 hrs/week | Founders do it |
| Phase 2 | ~20 hrs/week | Hire part-time ops person |
| Phase 3 | ~10 hrs/week | AI handles 60%+, human QA |
| Phase 4 | ~20 hrs/week | Dedicated data team |
 
---
 
## 6. Risk Mitigation
 
### Data Quality Risks
 
| Risk | Mitigation |
|------|------------|
| AI extracts incorrect data | Confidence thresholds, mandatory human review for low confidence |
| Stale data causes compliance failure | Source monitoring, freshness SLAs, customer notifications |
| Scope ambiguity causes wrong classification | Conservative defaults, "consult counsel" recommendations |
 
### Business Risks
 
| Risk | Mitigation |
|------|------------|
| Can't keep up with regulatory changes | Phase 3 AI pipeline, consider licensing for gap-fill |
| Competitor has better data | Focus on specific verticals first (parcel + SIMA) |
| Customer trust issues | Transparent data sourcing, audit trails, version history |
 
### Technical Risks
 
| Risk | Mitigation |
|------|------------|
| Schema changes break existing queries | Version all reference data, maintain backward compatibility |
| AI pipeline becomes maintenance burden | Modular design, easy to swap models/providers |
| Government sources become unavailable | Cache aggressively, multiple source strategies |
 
---
 
## 7. Connection to Current Work
 
### How This Fits with Phase 1 Foundation
 
The detector registry and rules engine we're building now are the **execution layer**. This proposal is about the **data layer** that feeds them.
We should revive the archived RIS model as embedded modules in the monolith so provenance and versioning are built in from day one.
 
The same applies to CRE + VAS designs: they should be adapted into the monolith
instead of left in archived services.
 
```
Current Work (Phase 1):           This Proposal:
┌─────────────────────┐           ┌─────────────────────┐
│  Detector Registry  │◄──────────│  Reference Data     │
│  Rules Engine       │           │  (tariffs, remedies,│
│  TenantDetectorConfig│          │   carrier rules)    │
└─────────────────────┘           └─────────────────────┘
         │                                  │
         ▼                                  ▼
┌─────────────────────┐           ┌─────────────────────┐
│  Detection Logic    │           │  Curation Pipeline  │
│  (code)             │           │  (AI + human)       │
└─────────────────────┘           └─────────────────────┘
```
 
### Immediate Next Steps (After PR #90)
 
1. **Adopt RIS provenance model in monolith** - Revive archived design in `ref.*`
2. **Migrate SIMA measures to database** - Stop hardcoding in TypeScript
3. **Add source tracking + temporal validity** - `source_*`, `effective_from/to` on reference data
4. **Build minimal admin API/CLI + versioning** - CRUD + audit trail for changes
 
---
 
## 8. Decision Points
 
### Near-term (Before YC Demo)
 
1. **Do we migrate SIMA to DB now or later?**
   - Recommendation: Now. Small effort, establishes pattern.
 
2. **Do we expand carrier coverage?**
   - Recommendation: Add DHL, USPS for demo breadth. 2-3 days work.
 
### Medium-term (Q2 2026)
 
3. **Build vs license tariff data?**
   - Recommendation: Start with government sources. License if customer demands exceed our curation capacity.
 
4. **Hire domain expert?**
   - Recommendation: Fractional consultant first. Full-time when ARR supports it.
 
### Long-term (Q3+ 2026)
 
5. **Become a data provider?**
   - If our AI curation pipeline works well, we could license data to others. Evaluate after Phase 3.
 
---
 
## 9. Success Metrics
 
| Metric | Phase 1 | Phase 2 | Phase 3 | Phase 4 |
|--------|---------|---------|---------|---------|
| Trade remedies covered | 8 | 450+ | 500+ | 1000+ |
| Carriers supported | 2 | 12 | 20 | 50+ |
| Data freshness SLA | Best effort | 7 days | 48 hours | 24 hours |
| Human curation hours/week | 5 | 20 | 10 | 20 |
| AI extraction accuracy | N/A | N/A | 85%+ | 92%+ |
| Customer data quality NPS | N/A | Baseline | +10 | +20 |
 
---
 
## 10. Recommendation
 
**Start now with Phase 1.** The effort is small (4 weeks), and it unblocks everything else. Hardcoded data is technical debt that compounds.
 
**Plan for Phase 2 in Q2.** This is when data becomes a competitive advantage vs. a liability.
 
**Evaluate AI curation seriously.** The "software building software" approach from the Pi article applies here—use LLMs to help maintain the data that powers the compliance engine. But always with human oversight for high-stakes decisions.
 
**Don't over-engineer early.** The schema above is conceptual. Start with what's needed for current customers and expand.
 
---
 
## 11. Appendix: Archived Design Revival Map
 
Map archived service designs to the monolith so we reuse proven patterns.
 
| Archived Design | Monolith Location | Priority | Notes |
|---|---|---|---|
| RIS provenance model | `apps/api/src/lib/ref/` (new) | P0 - Phase 1 | Embed RIS schema + provenance in monolith |
| RIS data sources table | `ref.data_sources` | P0 - Phase 1 | Standardize `source_*` metadata |
| CRE rule hierarchy | `packages/rules-engine/` | P1 - Phase 1.5 | 6-level cascade + conflict resolution |
| CRE validation types | Detector interfaces | P1 - Phase 1.5 | Align detector contracts with CRE |
| VAS audit trail | `app.audit_events` (exists) | P1 - Enhance | Append-only + hash lineage |
| VAS variance levels | `app.finding_thresholds` (new) | P2 | Consistent green/yellow/red thresholds |
| LCE calculation model | `packages/modules/finance/` | Phase 2+ | Rate + tax calculation engine |
 
**Note:** These are architectural patterns, not microservices. We should revive
the designs inside the monolith to keep scope manageable.
 
---
 
## Appendix: Government Data Sources
 
### United States
- **USITC HTS**: https://hts.usitc.gov/ (machine-readable available)
- **Federal Register**: https://www.federalregister.gov/api/v1/ (API)
- **CBP CROSS**: https://rulings.cbp.gov/ (rulings database)
- **OFAC SDN**: https://sanctionslist.ofac.treas.gov/ (daily updates)
 
### Canada
- **CBSA SIMA**: https://cbsa-asfc.gc.ca/sima-lmsi/mif-mev/menu-eng.html
- **Canada Tariff**: https://www.cbsa-asfc.gc.ca/trade-commerce/tariff-tarif/
- **CITT**: https://citt-tcce.gc.ca/ (appeals, scope rulings)
 
### Carriers (Public Documentation)
- UPS: https://www.ups.com/us/en/support/file-a-claim.page
- FedEx: https://www.fedex.com/en-us/customer-support/claims.html
- DHL: https://www.dhl.com/us-en/home/customer-service/file-a-claim.html
 
---
 
*This document is a starting point for discussion. Specific technical decisions should be validated against current customer needs and resource availability.*