Skip to Content
InternalDocsProposalsData Foundation Strategy

Data Foundation Strategy

Source: docs/proposals/data-foundation-strategy.md

# RGL8R Data Foundation Strategy **Author:** Claude Code **Date:** February 2026 **Status:** Draft for Discussion --- ## Executive Summary The RGL8R platform's value proposition depends on comprehensive, accurate regulatory data. The current MVP has ~8 SIMA measures, 2 carriers, and a handful of HS mappings—enough to demonstrate the concept but nowhere near production-ready. **Thesis:** RGL8R is a **regulatory data platform with compliance applications on top**. The data layer is the product; the app is the view and workflow surface. **The hard truth:** Without a scalable data strategy, we're building a Ferrari engine with no fuel system. This proposal outlines a phased approach to building the data foundation, with options ranging from "bootstrap lean" to "enterprise-grade." ### Core Principles - **Provenance:** every row stores `source_url`, `source_id`, `source_publication_date`, `source_effective_date`, and `retrieval_date`. - **Temporal validity:** `effective_from` / `effective_to` on everything. - **Determinism:** decisions are pinned to a specific **data release** (e.g., `CA-2026.02`). - **Coverage transparency:** every dataset has an explicit **coverage tier** (ALPHA/BETA/VERIFIED). - **Human-in-the-loop:** AI drafts, humans approve when stakes are high. --- ## 1. The Data Problem ### What We Have Today | Data Type | Current State | Coverage | |-----------|--------------|----------| | SIMA measures | 8 measures hardcoded | ~15% of active CBSA measures | | Carrier claim rules | 2 carriers (UPS, FedEx) | ~5% of parcel/freight carriers | | HS mappings | 7 prefixes with rules | <1% of HTS schedule | | Tariff rates | None | 0% | | Country regulations | None | 0% | | Sanctions/denied parties | None | 0% | | FTA/preferential programs | None | 0% | ### What a Production System Needs | Data Type | Scope | Update Frequency | Complexity | |-----------|-------|------------------|------------| | **HTS Tariff Schedule** | ~18,000 line items (10-digit) per country | Annual + interim updates | Medium | | **AD/CVD Orders** (SIMA, US 337, etc.) | ~500 active orders globally | Monthly | High | | **Carrier Claim Rules** | 50+ carriers × claim types × geographies | Quarterly | Medium | | **FTA Rules of Origin** | 20+ trade agreements × product categories | Rare but complex | Very High | | **Sanctions Lists** | OFAC, EU, UN, etc. (~10 lists) | Daily | Low (structured) | | **Country Import Requirements** | 195 countries × product categories | Varies | Very High | | **Duty Calculation Rules** | Ad valorem, specific, compound, TRQs | Per schedule update | High | ### The Maintenance Burden This isn't a "build once" problem. Regulations change constantly: - **CBSA SIMA:** New cases initiated, sunset reviews, scope rulings - **US Tariffs:** Section 301, Section 232, exclusions, extensions - **HTS Updates:** Annual revisions, mid-year amendments - **Carrier Policies:** Claim windows change, new surcharge types - **Sanctions:** Daily additions/removals **Rough estimate:** Maintaining comprehensive trade data requires 0.5-1 FTE of ongoing curation work, or significant licensing costs. --- ## 2. Strategic Options ### Option A: License Commercial Data **Providers:** - **Descartes/MK Data** - HTS, duty rates, FTAs, denied party screening - **Avalara CrossBorder** - Landed cost calculation, HS classification - **Thomson Reuters ONESOURCE** - Global trade content - **Integration Point (now E2open)** - Trade compliance data **Pros:** - Fast to market - Professionally maintained - Covers edge cases **Cons:** - $50K-500K/year depending on scope - Dependency on vendor roadmap - May not cover SHIP (parcel audit) use case - Margin compression **Verdict:** Good for enterprise tier, but kills unit economics for SMB. ### Option B: Government Source Aggregation + AI Curation **Sources:** - USITC (HTS schedule, AD/CVD orders) - CBSA (SIMA measures, tariff schedule) - CBP (rulings, CROSS database) - Federal Register (new orders, scope rulings) - WTO (AD/CVD notifications) **Approach:** 1. Build ingestion pipelines from official sources 2. Use LLMs to parse unstructured documents (Federal Register notices, scope rulings) 3. Human review for high-stakes classifications 4. Version control all data changes 5. Automated monitoring for source updates **Pros:** - Low/no licensing cost - Full control over data model - Can differentiate on freshness/accuracy - AI assistance scales curation **Cons:** - Significant engineering investment - Data quality responsibility is ours - Government sources are inconsistent - Need domain expertise for validation **Verdict:** Higher upfront cost, better long-term economics. Aligns with "software building software" philosophy. ### Option C: Hybrid (Recommended) **Phase 1:** Bootstrap with government sources + manual curation for pilot customers **Phase 2:** Expand coverage + release bundles + coverage tiers **Phase 3:** AI-assisted curation pipeline + regression testing **Phase 4:** Global expansion + selective licensed/partner data for gaps --- ## 3. Proposed Architecture ### 3.1 Reuse RIS Design (Archived) We already have a strong Regulatory Ingestion Service (RIS) design in `_archived/services/regulatory-ingestion-service/README.md`. The fastest, lowest-risk path is to **revive that data model inside the monolith** as embedded modules, not reinvent it. The RIS design already covers: - Provenance fields (`source_url`, `source_publication_date`, `source_effective_date`, `retrieval_date`) - Change detection (SHA-256 hashing, `change_type`, `change_summary`, `previous_version_id`) - Version history and lineage chains - Multi-source connectors (CBSA, TARIC, UK CDS, HTSUS) **Implementation note:** Keep reference data in a separate schema (e.g., `ref.*`) to avoid mixing tenant data with global reference data. **Also revive (from archived services):** - **CRE**: 6-level rule hierarchy + inheritance + conflict resolution. - **VAS**: dual-source validation, variance levels, immutable audit trail. - **LCE**: landed-cost calculation model (Phase 2+). ### 3.2 Data Model Principles ``` ┌─────────────────────────────────────────────────────────────┐ │ Reference Data Layer │ ├─────────────────────────────────────────────────────────────┤ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Tariff │ │ Trade │ │ Carrier │ │ │ │ Schedules │ │ Remedies │ │ Rules │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ FTA │ │ Sanctions │ │ Country │ │ │ │ Rules │ │ Lists │ │ Regs │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ ├─────────────────────────────────────────────────────────────┤ │ Versioning & Audit │ │ - Provenance + effective_from/effective_to │ │ - Release bundles (CA-2026.02) + checksums │ │ - Full audit trail of changes │ │ - Point-in-time queries ("what was the rate on X date?") │ └─────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────┐ │ Curation Pipeline │ ├─────────────────────────────────────────────────────────────┤ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Source │ │ AI │ │ Human │ │ │ │ Ingestion │──│ Extraction │──│ Review │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ │ │ │ Government Parse docs QA + approve │ │ APIs/feeds Extract rules Handle edge cases │ │ Web scraping Flag changes Domain expertise │ └─────────────────────────────────────────────────────────────┘ ``` ### Schema Design (Conceptual) Each reference table includes the standard RIS provenance fields (source URLs, publication/effective dates, retrieval dates, hashes, change summaries). Those are omitted below for brevity. ```prisma enum CoverageTier { ALPHA BETA VERIFIED } model DataRelease { id String @id @default(dbgenerated("gen_random_uuid()")) @db.Uuid releaseCode String // "CA-2026.02" jurisdiction String effectiveFrom DateTime effectiveTo DateTime? releaseNotes String? checksums Json // Per-table checksums @@unique([releaseCode]) } // Core reference tables - NOT tenant-specific model TariffSchedule { id String @id @default(dbgenerated("gen_random_uuid()")) @db.Uuid country String // ISO 3166-1 alpha-2 hsCode String // 10-digit description String mfnRate Decimal? // Most Favored Nation rate rateType String // ad_valorem, specific, compound unitOfMeasure String? effectiveFrom DateTime effectiveTo DateTime? sourceId String // Link to DataSource version Int coverageTier CoverageTier @default(ALPHA) dataReleaseId String? @@unique([country, hsCode, effectiveFrom]) @@index([country, hsCode]) } model TradeRemedy { id String @id @default(dbgenerated("gen_random_uuid()")) @db.Uuid country String // Imposing country (CA, US, EU) caseNumber String // AD/CVD case number caseType String // antidumping, countervailing, safeguard productScope String // Scope description originCountries String[] // Countries subject to order hsCodes String[] // Covered HS codes dutyType String // ad_valorem, specific, combination dutyRate Decimal? effectiveFrom DateTime effectiveTo DateTime? status String // active, suspended, revoked sourceId String version Int coverageTier CoverageTier @default(ALPHA) dataReleaseId String? scopeRulings ScopeRuling[] @@index([country, status]) @@index([hsCodes]) // Note: GIN index recommended for array fields, requires raw SQL migration } model ScopeRuling { id String @id @default(dbgenerated("gen_random_uuid()")) @db.Uuid remedyId String remedy TradeRemedy @relation(fields: [remedyId], references: [id]) rulingNumber String rulingDate DateTime productDesc String determination String // in_scope, out_of_scope, partial reasoning String? sourceUrl String? } model CarrierRule { id String @id @default(dbgenerated("gen_random_uuid()")) @db.Uuid carrier String ruleType String // claim_window, surcharge, service_guarantee geography String? // null = global parameters Json // Rule-specific params effectiveFrom DateTime effectiveTo DateTime? sourceId String version Int coverageTier CoverageTier @default(ALPHA) dataReleaseId String? @@unique([carrier, ruleType, geography, effectiveFrom]) } model DataSource { id String @id @default(dbgenerated("gen_random_uuid()")) @db.Uuid name String // "CBSA SIMA Measures in Force" url String? sourceType String // government, commercial, manual lastFetchedAt DateTime? fetchFrequency String? // daily, weekly, monthly @@map("data_sources") } model DataChangeLog { id String @id @default(dbgenerated("gen_random_uuid()")) @db.Uuid tableName String recordId String changeType String // insert, update, delete oldValue Json? newValue Json? changedBy String // user_id or "system:ai_pipeline" changedAt DateTime @default(now()) reviewedBy String? reviewedAt DateTime? @@index([tableName, recordId]) @@index([changedAt]) } // Regression test suite for data releases model GoldenTestCase { id String @id @default(dbgenerated("gen_random_uuid()")) @db.Uuid sku String hsCode String origin String expectedOutcome String // AT_RISK, CLEARED, NEEDS_REVIEW expectedMeasure String? notes String? } ``` **Determinism requirement:** every screening decision (e.g., `SIMAExposure`) should store `dataReleaseId` so outcomes are pinned to a specific release bundle. ### AI Curation Pipeline ``` ┌──────────────────────────────────────────────────────────────┐ │ Source Monitoring │ ├──────────────────────────────────────────────────────────────┤ │ Daily checks: │ │ - Federal Register for new AD/CVD notices │ │ - CBSA SIMA page for updates │ │ - USITC for HTS revisions │ │ - Carrier websites for policy changes │ │ │ │ Triggers alert when: │ │ - New document published │ │ - Existing page content changed (hash comparison) │ │ - RSS/API update detected │ └──────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────┐ │ AI Extraction │ ├──────────────────────────────────────────────────────────────┤ │ LLM tasks: │ │ - Parse Federal Register notice → structured TradeRemedy │ │ - Extract HS codes from scope language │ │ - Identify affected countries │ │ - Extract duty rates and formulas │ │ - Flag ambiguous scope language for human review │ │ │ │ Confidence scoring: │ │ - High confidence (>90%): auto-approve, human spot-check │ │ - Medium (70-90%): human review required │ │ - Low (<70%): flag for expert review │ └──────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────┐ │ Human Review Queue │ ├──────────────────────────────────────────────────────────────┤ │ Review interface: │ │ - Side-by-side: source document + extracted data │ │ - Diff view for updates │ │ - Approve / Edit / Reject actions │ │ - Audit trail of all decisions │ │ │ │ Escalation: │ │ - Complex scope rulings → trade counsel review │ │ - Rate calculations → finance review │ │ - New measure types → product team review │ └──────────────────────────────────────────────────────────────┘ ``` --- ## 4. Phased Roadmap ### Phase 1: Structured Foundation (Q1 2026) **Goal:** Move hardcoded data to DB using RIS provenance model | Task | Effort | Outcome | |------|--------|---------| | Adopt RIS provenance schema (`ref.*`) | 1 week | Standardized provenance + change tracking tables | | Migrate SIMA measures to DB | 3 days | 8 measures in `trade_remedies` table | | Migrate carrier rules to DB | 2 days | Claim windows configurable per tenant | | Build admin API/CLI for reference data | 1 week | Internal CRUD for reference data | | Add versioning/audit trail | 3 days | Point-in-time queries work | **Deliverable:** Reference data is database-driven, not hardcoded. Updates don't require deploys. ### Phase 2: Expand Coverage (Q2 2026) **Goal:** Reach North America coverage with transparent quality + release bundles | Task | Effort | Outcome | |------|--------|---------| | Add coverage tiers (ALPHA/BETA/VERIFIED) | 2 days | Explicit data quality levels | | Add release bundles + checksums | 3 days | Decisions pinned to `DataRelease` | | Ingest full CBSA SIMA list | 1 week | All ~50 active measures | | Ingest US AD/CVD orders | 2 weeks | All ~400 active orders | | Add 10 more carriers | 1 week | Cover 80% of parcel volume | | Build source monitoring | 1 week | Alerts on regulatory changes | | Manual curation workflow | 1 week | Internal tool for data updates | **Deliverable:** Comprehensive North America trade remedy coverage. Can serve real customers. ### Phase 3: AI-Assisted Curation (Q3 2026) **Goal:** Scale data maintenance without linear headcount | Task | Effort | Outcome | |------|--------|---------| | Golden test suite + regression harness | 1-2 weeks | Release validation before rollout | | Federal Register parser | 2 weeks | Auto-extract from new notices | | Scope ruling analyzer | 2 weeks | Parse rulings, update coverage | | Change detection pipeline | 1 week | Monitor all sources continuously | | Confidence scoring | 1 week | Route to human review appropriately | | Review queue UI | 2 weeks | Non-technical staff can QA | **Deliverable:** New regulations automatically detected and drafted. Human review for approval. ### Phase 4: Global Expansion (Q4 2026+) **Goal:** Multi-country support | Task | Effort | Outcome | |------|--------|---------| | EU trade remedies | 3 weeks | AD/CVD, safeguards | | UK/EU tariff schedules | 2 weeks | Post-Brexit complexity | | APAC key markets | 4 weeks | AU, JP, KR, CN | | FTA rules of origin | 6+ weeks | USMCA, CPTPP, etc. | | Partner data integrations | TBD | Fill gaps with licensed sources | **Deliverable:** True global trade compliance platform. --- ## 5. Resource Requirements ### Engineering | Phase | Duration | Engineers | Notes | |-------|----------|-----------|-------| | Phase 1 | 4 weeks | 1 | Schema + migration + basic admin | | Phase 2 | 8 weeks | 1-2 | Data ingestion + manual tooling | | Phase 3 | 8 weeks | 2 | AI pipeline + review workflows | | Phase 4 | Ongoing | 2+ | Geographic expansion | ### Domain Expertise | Role | When Needed | Options | |------|-------------|---------| | Trade compliance SME | Phase 2+ | Fractional consultant, $150-250/hr | | Customs broker advisor | Phase 2+ | Partnership with broker | | Legal review | Phase 3+ | Outside counsel for scope rulings | ### Data Operations | Phase | Curation Effort | Approach | |-------|-----------------|----------| | Phase 1 | ~5 hrs/week | Founders do it | | Phase 2 | ~20 hrs/week | Hire part-time ops person | | Phase 3 | ~10 hrs/week | AI handles 60%+, human QA | | Phase 4 | ~20 hrs/week | Dedicated data team | --- ## 6. Risk Mitigation ### Data Quality Risks | Risk | Mitigation | |------|------------| | AI extracts incorrect data | Confidence thresholds, mandatory human review for low confidence | | Stale data causes compliance failure | Source monitoring, freshness SLAs, customer notifications | | Scope ambiguity causes wrong classification | Conservative defaults, "consult counsel" recommendations | ### Business Risks | Risk | Mitigation | |------|------------| | Can't keep up with regulatory changes | Phase 3 AI pipeline, consider licensing for gap-fill | | Competitor has better data | Focus on specific verticals first (parcel + SIMA) | | Customer trust issues | Transparent data sourcing, audit trails, version history | ### Technical Risks | Risk | Mitigation | |------|------------| | Schema changes break existing queries | Version all reference data, maintain backward compatibility | | AI pipeline becomes maintenance burden | Modular design, easy to swap models/providers | | Government sources become unavailable | Cache aggressively, multiple source strategies | --- ## 7. Connection to Current Work ### How This Fits with Phase 1 Foundation The detector registry and rules engine we're building now are the **execution layer**. This proposal is about the **data layer** that feeds them. We should revive the archived RIS model as embedded modules in the monolith so provenance and versioning are built in from day one. The same applies to CRE + VAS designs: they should be adapted into the monolith instead of left in archived services. ``` Current Work (Phase 1): This Proposal: ┌─────────────────────┐ ┌─────────────────────┐ │ Detector Registry │◄──────────│ Reference Data │ │ Rules Engine │ │ (tariffs, remedies,│ │ TenantDetectorConfig│ │ carrier rules) │ └─────────────────────┘ └─────────────────────┘ │ │ ▼ ▼ ┌─────────────────────┐ ┌─────────────────────┐ │ Detection Logic │ │ Curation Pipeline │ │ (code) │ │ (AI + human) │ └─────────────────────┘ └─────────────────────┘ ``` ### Immediate Next Steps (After PR #90) 1. **Adopt RIS provenance model in monolith** - Revive archived design in `ref.*` 2. **Migrate SIMA measures to database** - Stop hardcoding in TypeScript 3. **Add source tracking + temporal validity** - `source_*`, `effective_from/to` on reference data 4. **Build minimal admin API/CLI + versioning** - CRUD + audit trail for changes --- ## 8. Decision Points ### Near-term (Before YC Demo) 1. **Do we migrate SIMA to DB now or later?** - Recommendation: Now. Small effort, establishes pattern. 2. **Do we expand carrier coverage?** - Recommendation: Add DHL, USPS for demo breadth. 2-3 days work. ### Medium-term (Q2 2026) 3. **Build vs license tariff data?** - Recommendation: Start with government sources. License if customer demands exceed our curation capacity. 4. **Hire domain expert?** - Recommendation: Fractional consultant first. Full-time when ARR supports it. ### Long-term (Q3+ 2026) 5. **Become a data provider?** - If our AI curation pipeline works well, we could license data to others. Evaluate after Phase 3. --- ## 9. Success Metrics | Metric | Phase 1 | Phase 2 | Phase 3 | Phase 4 | |--------|---------|---------|---------|---------| | Trade remedies covered | 8 | 450+ | 500+ | 1000+ | | Carriers supported | 2 | 12 | 20 | 50+ | | Data freshness SLA | Best effort | 7 days | 48 hours | 24 hours | | Human curation hours/week | 5 | 20 | 10 | 20 | | AI extraction accuracy | N/A | N/A | 85%+ | 92%+ | | Customer data quality NPS | N/A | Baseline | +10 | +20 | --- ## 10. Recommendation **Start now with Phase 1.** The effort is small (4 weeks), and it unblocks everything else. Hardcoded data is technical debt that compounds. **Plan for Phase 2 in Q2.** This is when data becomes a competitive advantage vs. a liability. **Evaluate AI curation seriously.** The "software building software" approach from the Pi article applies here—use LLMs to help maintain the data that powers the compliance engine. But always with human oversight for high-stakes decisions. **Don't over-engineer early.** The schema above is conceptual. Start with what's needed for current customers and expand. --- ## 11. Appendix: Archived Design Revival Map Map archived service designs to the monolith so we reuse proven patterns. | Archived Design | Monolith Location | Priority | Notes | |---|---|---|---| | RIS provenance model | `apps/api/src/lib/ref/` (new) | P0 - Phase 1 | Embed RIS schema + provenance in monolith | | RIS data sources table | `ref.data_sources` | P0 - Phase 1 | Standardize `source_*` metadata | | CRE rule hierarchy | `packages/rules-engine/` | P1 - Phase 1.5 | 6-level cascade + conflict resolution | | CRE validation types | Detector interfaces | P1 - Phase 1.5 | Align detector contracts with CRE | | VAS audit trail | `app.audit_events` (exists) | P1 - Enhance | Append-only + hash lineage | | VAS variance levels | `app.finding_thresholds` (new) | P2 | Consistent green/yellow/red thresholds | | LCE calculation model | `packages/modules/finance/` | Phase 2+ | Rate + tax calculation engine | **Note:** These are architectural patterns, not microservices. We should revive the designs inside the monolith to keep scope manageable. --- ## Appendix: Government Data Sources ### United States - **USITC HTS**: https://hts.usitc.gov/ (machine-readable available) - **Federal Register**: https://www.federalregister.gov/api/v1/ (API) - **CBP CROSS**: https://rulings.cbp.gov/ (rulings database) - **OFAC SDN**: https://sanctionslist.ofac.treas.gov/ (daily updates) ### Canada - **CBSA SIMA**: https://cbsa-asfc.gc.ca/sima-lmsi/mif-mev/menu-eng.html - **Canada Tariff**: https://www.cbsa-asfc.gc.ca/trade-commerce/tariff-tarif/ - **CITT**: https://citt-tcce.gc.ca/ (appeals, scope rulings) ### Carriers (Public Documentation) - UPS: https://www.ups.com/us/en/support/file-a-claim.page - FedEx: https://www.fedex.com/en-us/customer-support/claims.html - DHL: https://www.dhl.com/us-en/home/customer-service/file-a-claim.html --- *This document is a starting point for discussion. Specific technical decisions should be validated against current customer needs and resource availability.*