Attribute Strategy

Source: docs/architecture/attribute-strategy.md

# Attribute Strategy: SIMA Risk Assessment
 
## Overview
 
This document describes the intentional design decisions around attribute handling in the TRADE pilot pipeline, specifically why we validate only three attributes for SIMA risk assessment while preserving all attributes for future use cases.
 
## Context
 
### Catalog Data Structure
 
The catalog Excel files contain rich product data:
- **24 columns** per row (SKU, Product_Name, HS codes, Provider classifications, etc.)
- **35+ rows per SKU** (Schema Tag pattern - one attribute per row)
- **2,591 unique attribute titles** across the dataset
 
Example attributes include:
- Product identifiers: `SKU`, `Product_Name`, `Class_Name`
- Classification: `Wayfair HS codes`, `Consensus HS Code`, `Top Pick Code`
- Materials: `Material`, `Primary Material`, `Sink Material`, `Frame Material`, etc.
- Origin: `Country of Origin`, `Country of Origin - Additional Details`
- Compliance: `CPSC Compliant`, `ASTM Compliant`, `CAL TB 117-2013 Compliant`
- Physical: `Weight`, `Overall Height`, `Overall Width`, `Overall Depth`
- And ~2,580 more product-specific attributes
 
### SIMA Decision Surface
 
SIMA (Special Import Measures Act) risk classification requires exactly **three inputs**:
 
| Attribute | Purpose | Example |
|-----------|---------|---------|
| **HS Code** | Determines which SIMA measure applies | `7324100010` → Stainless Steel Sinks |
| **Country of Origin** | Determines if product is from a subject country | `CN` → Subject, `CA` → Exempt |
| **Material** | Confirms product type matches measure definition | `stainless steel` → SSS measure |
 
No other attributes affect SIMA risk classification. Adding `Weight`, `CPSC Compliant`, or `Dimensions` would not change whether a product is AT_RISK, CLEARED, or FALSE_POSITIVE.
 
## Design Decision
 
### Intentional Scope: 3 Attributes for SIMA
 
The trade module validation logic (`packages/modules/trade/src/validation.ts`) normalizes exactly three attributes:
 
```
country_of_origin  →  ISO 2-letter code (CN, CA, DE, etc.)
hs_code            →  10-digit normalized code (no dots)
material           →  Categorized type (stainless_steel, aluminum, etc.)
```
 
**This is a feature, not a limitation.**
 
The 4,500 attribute validations (3 × 1,500 SKUs) represent **100% coverage of SIMA-critical fields**.
 
### Rationale
 
1. **Accuracy over breadth** - Validating irrelevant attributes adds noise without improving SIMA classification
2. **Performance** - Slim payloads when passing only validated attributes
3. **Stability** - SIMA logic is decoupled from unrelated attribute changes
4. **Auditability** - Clear linkage between validated inputs and risk outputs
 
## Architecture
 
### Current Data Flow
 
```
┌─────────────────────────────────────────────────────────────────────┐
│  CATALOG EXCEL                                                      │
│  2,591 unique attributes × 1,500 SKUs                               │
└───────────────────────────────┬─────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│  APPS/API (CATALOG UPLOAD)                                          │
│  • Parses all attributes from Excel                                 │
│  • Writes full attribute set to app.sku_attributes (JSONB)          │
└───────────────────────────────┬─────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│  JOB PROCESSOR (apps/api)                                           │
│  • Validates 3 SIMA attributes (packages/modules/trade/validation)  │
│  • Writes: app.attribute_validations                                │
│  • Computes outcomes (packages/modules/trade/sima)                  │
│  • Writes: app.sima_outcomes (all SKUs)                             │
│  • Writes: app.sima_exposures (measure-covered subset)              │
│  • Writes: app.sku_ingestions (payload/input hashes + outcome)      │
│  • Writes: app.ingestion_batches (file/job metadata)                │
└─────────────────────────────────────────────────────────────────────┘
```
 
### Database Schema
 
#### attribute_validations (SIMA-focused)
Stores validated/normalized values for the 3 SIMA-critical attributes:
```sql
CREATE TABLE app.attribute_validations (
  id UUID PRIMARY KEY,
  tenant_id UUID NOT NULL,
  sku VARCHAR(100) NOT NULL,
  attribute_name VARCHAR(100) NOT NULL,  -- country_of_origin | hs_code | material
  derived_value TEXT,                     -- Normalized value (CN, 7324100010, stainless_steel)
  source_raw TEXT,                        -- Original value from Excel
  confidence DECIMAL(3,2),                -- 0.00-1.00
  validation_rule_id VARCHAR(100),
  validated_at TIMESTAMP,
  UNIQUE(tenant_id, sku, attribute_name)
);
```
 
#### sku_attributes (Full attribute store)
Stores all parsed attributes for future use cases:
```sql
CREATE TABLE app.sku_attributes (
  id UUID PRIMARY KEY,
  tenant_id UUID NOT NULL,
  sku VARCHAR(100) NOT NULL,
  source VARCHAR(50) NOT NULL,            -- 'catalog', 'shopify', etc.
  parent_sku VARCHAR(100),                -- Optional parent SKU (true components)
  attributes JSONB NOT NULL,              -- Full attribute map
  metadata JSONB,                         -- Source file, row count, etc.
  ingested_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP NOT NULL DEFAULT NOW(),  -- Auto-updated via trigger
  UNIQUE(tenant_id, sku, source)
);
 
-- Index for JSONB queries
CREATE INDEX sku_attributes_attributes_gin ON app.sku_attributes USING GIN (attributes);
-- Partial index: only indexes rows with non-null parent_sku (more efficient)
CREATE INDEX sku_attributes_tenant_id_source_parent_sku_idx ON app.sku_attributes (tenant_id, source, parent_sku) WHERE parent_sku IS NOT NULL;
 
-- Trigger to auto-update updated_at on row modification
CREATE TRIGGER update_sku_attributes_updated_at
BEFORE UPDATE ON app.sku_attributes
FOR EACH ROW EXECUTE FUNCTION app.update_updated_at_column();
```
 
#### sku_ingestions (Request history + duplicate detection)
Stores one record per SKU request to detect exact duplicates and track outcomes:
```sql
CREATE TABLE app.sku_ingestions (
  id UUID PRIMARY KEY,
  tenant_id UUID NOT NULL,
  sku VARCHAR(100) NOT NULL,
  source VARCHAR(50) NOT NULL,
  payload JSONB NOT NULL,
  payload_hash VARCHAR(64) NOT NULL,
  inputs JSONB NOT NULL,
  inputs_hash VARCHAR(64) NOT NULL,
  outcome_status VARCHAR(20) NOT NULL,
  sima_config_version VARCHAR(20),
  created_at TIMESTAMP DEFAULT NOW()
);
```
 
#### ingestion_batches (File/job-level tracking)
Stores upload-level metadata and file hashes to detect duplicate submissions:
```sql
CREATE TABLE app.ingestion_batches (
  id UUID PRIMARY KEY,
  tenant_id UUID NOT NULL,
  source VARCHAR(50) NOT NULL,
  file_hash VARCHAR(64),
  total_skus INT,
  exact_duplicate_skus INT,
  created_at TIMESTAMP DEFAULT NOW()
);
```
 
## Future Use Cases
 
The `sku_attributes` table enables future services without modifying SIMA logic:
 
| Use Case | Attributes Needed |
|----------|-------------------|
| **Landed Cost Calculation** | Weight, Dimensions, Declared Value |
| **Safety Compliance** | CPSC, ASTM, CAL TB certifications |
| **Duty Drawback** | Commercial Warranty, Assembly Required |
| **Sustainability Reporting** | Eco-Friendly, Recycled Content |
| **Product Classification** | All materials, specifications |
 
### Expansion Pattern
 
To add a new compliance check:
 
1. **Query `sku_attributes`** for needed fields
2. **Create a new validator module** (e.g., `packages/modules/safety`)
3. **Wire it into apps/api job processing or routes**
4. **Write to a new table** (e.g., `safety_validations`)
 
SIMA logic remains unchanged.
 
## In-Process Contract
 
Catalog uploads are processed in-process by `apps/api`. The job input stores a file path and tenant ID; the parser reads the Excel file directly and writes results to `app.sku_attributes`, `app.attribute_validations`, and `app.sima_outcomes`. Each upload also records hash-based ingestion history in `app.sku_ingestions` and `app.ingestion_batches` to flag exact duplicates.
 
## Summary
 
| Component | Attributes Handled | Storage |
|-----------|-------------------|---------|
| apps/api job processor | All 2,591 (parsed) | app.sku_attributes (JSONB) |
| apps/api validation step | 3 (country, material, HS) | app.attribute_validations |
| apps/api SIMA outcomes | Outcome per SKU | app.sima_outcomes |
| apps/api ingestion tracking | Payload + inputs hashes | app.sku_ingestions |
 
This design ensures:
- **SIMA accuracy** - Only relevant attributes affect classification
- **Future flexibility** - All attributes preserved for expansion
- **Performance** - In-process validation with minimal overhead
- **Maintainability** - Clear separation of concerns