AcademyCDPIModule 8: Metadata Architecture
0%

LESSON 3: DOCUMENT DATABASES AND PASSPORT REPOSITORIES

Lesson Overview

This lesson covers document databases and passport repositories for Digital Product Passport implementations. Students will learn about JSON document storage, flexible schemas, passport repositories, version management, and how to leverage document databases for DPP data that naturally fits document structure. The lesson provides practical guidance on implementing document-based storage for passport data.

Learning Objectives

  • Design effective document database schemas for DPP passports
  • Implement flexible schemas that accommodate evolution
  • Design passport repositories with version management
  • Optimize document database performance for DPP workloads
  • Implement data access patterns for document databases
  • Manage document database evolution and migration

Detailed Content

Document Database Overview

Document databases store semi-structured data in document formats (typically JSON, BSON, or XML). For DPP systems, document databases are ideal because passport data naturally fits document structure—each passport is a self-contained document with nested attributes, relationships, and metadata. Document databases provide flexible schemas, horizontal scalability, and rich query capabilities.

Document Model: The document model stores data as documents rather than rows in tables. Documents are typically JSON-like structures with fields and values. Documents can be nested (documents within documents) and can contain arrays (lists of values). This model naturally represents hierarchical data like product hierarchies and bill of materials. For DPP systems, the document model aligns with CEDM's document-oriented design.

Flexible Schemas: Document databases have flexible schemas—different documents in the same collection can have different fields. This flexibility enables schema evolution without downtime and accommodates diverse product types with different attributes. Flexibility is valuable for DPP systems where different product categories have different attributes and where schemas evolve over time.

Horizontal Scalability: Document databases scale horizontally by sharding data across multiple servers. Sharding distributes data based on shard keys, enabling linear scalability. Horizontal scalability is essential for DPP systems that must accommodate millions of products and high query volumes. Document databases like MongoDB and Couchbase are designed for horizontal scalability.

Rich Query Capabilities: Modern document databases provide rich query capabilities. Capabilities include document queries (find documents matching criteria), aggregation pipelines (complex data processing), full-text search (text search within documents), and geospatial queries (location-based queries). These capabilities enable sophisticated analysis and search of DPP data. For DPP systems, rich query capabilities support consumer search and regulatory reporting.

Passport Document Structure

Passport documents should be designed to represent the complete passport data in a coherent, queryable structure. Effective document design enables efficient queries and supports regulatory requirements.

Document Root: The passport document root contains core passport information. Root fields include passport_id (unique identifier), product_id (identifier of the product), product_type (type of product), manufacturer (manufacturer information), creation_date (when passport was created), and last_updated (when passport was last updated). Root fields should be indexed for efficient lookup. For DPP systems, the document root aligns with CEDM core elements.

Product Section: The product section contains product-specific data. Section includes product_attributes (key-value pairs of product attributes), product_classifications (classification codes and systems), and product_specifications (technical specifications). Product attributes should use consistent naming conventions and should be queryable. For DPP systems, the product section contains the bulk of passport data.

Organization Section: The organization section contains actor information. Section includes manufacturer (manufacturer details), suppliers (list of suppliers with their roles), and other_actors (verifiers, recyclers, etc.). Organization section should use consistent identifiers (GLN) and should enable relationship queries. For DPP systems, the organization section enables supply chain traceability.

Evidence Section: The evidence section contains references to supporting documents. Section includes certificates (list of certificates with references), test_reports (list of test reports), and other_evidence (other supporting documents). Evidence section should store references (object storage keys) rather than full documents to keep passport size manageable. For DPP systems, the evidence section enables verification and compliance demonstration.

Metadata Section: The metadata section contains passport metadata. Section includes lifecycle_metadata (creation, updates, archival status), access_metadata (access control, permissions), and governance_metadata (ownership, stewardship). Metadata section should support lifecycle management and access control. For DPP systems, the metadata section enables governance and lifecycle management.

Flexible Schema Design

Flexible schemas enable document databases to accommodate diverse product types and evolving requirements without schema changes. Effective flexible schema design balances flexibility with queryability.

Schema Evolution Strategies: Different strategies manage schema evolution. Strategies include additive evolution (add new fields without removing old), versioned documents (include schema version in document), and backward compatibility (new schema works with old documents). Additive evolution is simplest and should be preferred. For DPP systems, additive evolution enables schema changes without breaking existing data.

Optional Fields: New fields should be added as optional to maintain backward compatibility. Optional fields allow documents with old schema to coexist with documents with new schema. Applications should handle missing fields gracefully (use default values or null checks). For DPP systems, optional fields enable gradual schema migration across the ecosystem.

Schema Validation: While document databases have flexible schemas, validation is still important for data quality. Validation can be implemented at application level (validate before write) or database level (using schema validation features). Validation should include type validation (data types are correct), constraint validation (values meet constraints), and business rule validation (data meets business rules). For DPP systems, schema validation is essential for data quality despite flexible schemas.

Schema Documentation: Flexible schemas require clear documentation to ensure consistent interpretation. Documentation should include field definitions (what each field means), allowed values (enumerations, controlled vocabularies), and examples (example documents). Documentation should be versioned with schema versions. For DPP systems, schema documentation is critical for ecosystem interoperability.

Version Management

Passport documents change over time—products are updated, evidence is added, regulations change. Version management enables tracking of document evolution and retrieval of historical versions.

Document Versioning: Document versioning can be implemented in different ways. Approaches include versioned documents (each version is a separate document with version ID), embedded versions (history embedded in single document), and external versioning (version tracking in separate collection). Versioned documents provide clean separation but increase storage. Embedded versions provide single-document history but increase document size. For DPP systems, versioned documents are typically preferred for clean separation and queryability.

Version Metadata: Each version should include version metadata. Metadata includes version_id (unique version identifier), version_number (sequential version number), change_description (what changed in this version), change_timestamp (when change occurred), and change_actor (who made the change). Metadata enables audit trails and change analysis. For DPP systems, version metadata is essential for regulatory compliance and audit trails.

Current Version Tracking: Systems need to identify the current version of each passport. Approaches include current flag (boolean field indicating current version), separate current collection (separate collection with only current versions), and query-based (query for latest version by timestamp). Current flag is simple but requires atomic updates. Separate collection provides clean separation but requires synchronization. For DPP systems, current flag with atomic updates is commonly used.

Version Retrieval: Applications need to retrieve specific versions of passports. Retrieval should support retrieving current version (most common use case), retrieving specific version (by version ID), and retrieving version history (all versions for a passport). Retrieval should be efficient and should support pagination for version history. For DPP systems, version retrieval is essential for audit trails and historical analysis.

Performance Optimization

Document database performance is critical for DPP systems, especially for consumer-facing queries and high-volume data access. Performance optimization requires attention to indexing, data modeling, and query patterns.

Indexing Strategy: Indexing is the primary performance optimization technique. Indexes should be created on frequently queried fields (passport_id, product_id, product_type, manufacturer_id). Compound indexes support multi-field queries. Indexes improve read performance but add write overhead. Indexing strategy should be based on query patterns and should be monitored for effectiveness. For DPP systems, indexing on passport_id and product_id is essential for efficient lookup.

Data Modeling for Performance: Data modeling affects performance. Modeling considerations include embedding vs referencing (embed related data for read performance, reference for write performance), document size (keep documents under size limits), and array design (avoid unbounded arrays). Embedding improves read performance for related data but increases document size and update complexity. For DPP systems, embedding is appropriate for frequently accessed related data (e.g., product attributes), referencing for large collections (e.g., evidence).

Query Optimization: Query optimization ensures efficient document retrieval. Optimization includes using appropriate query operators (use indexed fields in queries), limiting result sets (limit, pagination), and avoiding expensive operations (avoid large array scans, expensive regex). Query patterns should be reviewed and optimized based on execution plans. For DPP systems, query optimization is essential for consumer-facing performance.

Sharding Strategy: Sharding distributes data across multiple servers for horizontal scalability. Sharding key selection is critical—shard key should distribute data evenly and support query patterns. Common shard keys include product_id (even distribution), manufacturer_id (group by manufacturer), and product_type (group by category). Shard key should be chosen based on query patterns and data distribution. For DPP systems, sharding by product_id or manufacturer_id is common.

Data Access Patterns

Data access patterns define how applications interact with document databases. Effective patterns ensure efficient data access while maintaining data integrity.

CRUD Operations: CRUD operations for document databases differ from relational databases. Create (insert new document), Read (find documents by query), Update (update document fields or replace entire document), and Delete (remove document). Document databases provide atomic document-level operations—updates are atomic at the document level. For DPP systems, CRUD operations should be implemented through data access layers that enforce business rules.

Atomic Updates: Document databases provide atomic updates at the document level. Updates can be field-level (update specific fields) or document-level (replace entire document). Atomic updates prevent partial updates and ensure consistency. For DPP systems, atomic updates are valuable for maintaining data integrity during concurrent updates.

Aggregation Pipelines: Aggregation pipelines enable complex data processing and analysis. Pipelines include stages for filtering (match documents), grouping (group by fields), aggregation (sum, count, average), and projection (select fields). Aggregation pipelines are powerful for reporting and analytics. For DPP systems, aggregation pipelines are valuable for regulatory reporting and supply chain analytics.

Bulk Operations: Bulk operations improve performance for high-volume operations. Bulk insert (insert multiple documents in single operation), bulk update (update multiple documents in single operation), and bulk delete (delete multiple documents in single operation) reduce round-trips and improve throughput. Bulk operations should be used for supplier data submission and bulk updates. For DPP systems, bulk operations are essential for handling high-volume data loads.

Document Database Features

Modern document databases offer advanced features that can enhance DPP system capabilities.

Change Streams: Change streams provide real-time notifications of database changes. Applications can subscribe to change streams to receive notifications when documents are inserted, updated, or deleted. Change streams enable event-driven architectures and real-time synchronization. For DPP systems, change streams are valuable for real-time supply chain visibility and cache invalidation.

Full-Text Search: Many document databases include full-text search capabilities. Full-text search enables text search within document fields, enabling search without separate search engine. Capabilities include text indexes, phrase search, and relevance ranking. For DPP systems, full-text search can supplement dedicated search engines for simpler use cases.

Time Series Collections: Some document databases have optimized collections for time-series data. Time series collections provide efficient storage and query of timestamped data. Time series collections are appropriate for sensor data, performance metrics, and audit logs. For DPP systems, time series collections may be used for operational monitoring and IoT data.

GridFS: GridFS is a specification for storing large files in document databases. GridFS splits large files into chunks and stores them as documents. GridFS enables storing large files (evidence documents, media) alongside structured data. For DPP systems, GridFS can be an alternative to object storage for some use cases, though object storage is typically preferred for large files.

Technical Concepts

  • Document Database: Database storing semi-structured documents
  • JSON: JavaScript Object Notation, common document format
  • Flexible Schema: Schema that allows different documents to have different fields
  • Sharding: Horizontal partitioning of data across multiple servers
  • Aggregation Pipeline: Framework for complex data processing
  • Change Stream: Real-time notification of database changes
  • Document Versioning: Tracking multiple versions of documents
  • Atomic Update: Update that completes entirely or not at all
  • Embedding: Storing related data within a document
  • Referencing: Storing references to other documents
  • Index: Data structure improving query performance
  • GridFS: Specification for storing large files in document databases

Architecture Considerations

Database Architecture: Design document database architecture based on requirements. Consider single collection (all passports in one collection) vs multiple collections (separate collections by product type). Single collection simplifies queries but may result in large documents. Multiple collections provide separation but require cross-collection queries. For DPP systems, single collection with product_type field is common for simplicity, multiple collections for distinct product types with very different schemas.

Data Modeling Strategy: Design data modeling strategy for DPP documents. Consider embedding (store related data in document) vs referencing (store references to other documents). Embedding improves read performance but increases document size. Referencing provides flexibility but requires additional queries. For DPP systems, embedding for frequently accessed related data, referencing for large collections and optional data.

Replication Architecture: Design replication for high availability. Replication includes replica sets (primary with multiple secondaries) and read preferences (direct reads to appropriate replicas). Replica sets provide automatic failover. Read preferences enable distributing read load. For DPP systems, replica sets with appropriate read preferences are essential for high availability and read scalability.

Sharding Architecture: Design sharding for horizontal scalability. Architecture includes shard key selection (how data is distributed), shard balancing (even distribution across shards), and shard management (adding/removing shards). Shard key selection is critical and should be based on query patterns. For DPP systems, sharding by product_id or manufacturer_id is common.

Security Architecture: Design security for document database access. Security includes authentication (database credentials, certificates), authorization (role-based access control), and encryption (encryption at rest and in transit). Security should be defense-in-depth. For DPP systems, security is critical for protecting sensitive passport data.

Implementation Considerations

Database Selection: Select appropriate document database. Options include MongoDB (popular, feature-rich), Couchbase (high performance, enterprise features), Amazon DocumentDB (managed MongoDB), and Azure Cosmos DB (multi-model). Selection should be based on requirements, team expertise, and cloud provider preferences. For DPP systems, MongoDB is commonly used for its ecosystem and features.

Driver Selection: Select appropriate database driver for application language. Drivers should support connection pooling, retry logic, and query optimization. Driver selection should be based on language ecosystem and features. For DPP systems, use official drivers from database vendor.

Connection Configuration: Configure database connections appropriately. Configuration includes connection string (server addresses, credentials), connection pool (pool size, timeout), and retry policy (retry logic for transient failures). Configuration should be tuned based on workload characteristics. For DPP systems, connection configuration is essential for high-concurrency scenarios.

Index Implementation: Implement indexes based on query patterns. Indexes should be created on frequently queried fields (passport_id, product_id, product_type, manufacturer_id). Compound indexes should support multi-field queries. Indexes should be monitored for usage and effectiveness. For DPP systems, indexing is essential for query performance.

Validation Implementation: Implement validation at application level. Validation should include schema validation (validate against JSON Schema), business rule validation (validate domain rules), and reference validation (validate references to valid entities). Validation should provide clear error messages. For DPP systems, validation is essential for data quality despite flexible schemas.

Enterprise Examples

Battery Document Database: A European automotive manufacturer implemented MongoDB for EV battery passport storage. Document structure followed CEDM with embedded product attributes and referenced evidence documents. Versioned documents tracked all changes with version metadata. Sharding by product_id distributed data across multiple shards. Change streams enabled real-time synchronization with search index. The implementation supported 15+ year retention with flexible schema evolution for new battery types.

Textile Document Database: A European textile industry association implemented Couchbase for textile passport platform. Document structure included multi-tenancy through organization_id field. Flexible schema accommodated diverse textile product types. Full-text search enabled product search without separate search engine. Aggregation pipelines generated sustainability reports. The implementation supported industry-wide passport storage with high performance and schema flexibility.

Electronics Document Database: A consumer electronics manufacturer implemented Amazon DocumentDB for electronic product passport storage. Document structure used embedding for frequently accessed data (product attributes) and referencing for large collections (evidence). Time series collections stored operational metrics. GridFS stored some evidence documents directly in the database. The implementation supported global product portfolios with managed database service reducing operational burden.

Common Mistakes

Large Documents: Creating documents that are too large, resulting in poor performance. Document databases have document size limits (typically 16MB). Large documents should be avoided by referencing large data (evidence documents) rather than embedding. Large documents also increase network transfer and memory usage.

No Indexing: Not creating indexes on frequently queried fields, resulting in poor query performance. Indexes should be created on fields used in queries, especially foreign key equivalents and filter fields. Indexing strategy should be based on query patterns.

Over-Embedding: Embedding too much data in documents, resulting in large documents and poor update performance. Embedding should be used for frequently accessed, small related data. Large collections and optional data should be referenced.

Ignoring Versioning: Not implementing version management, resulting in inability to track document evolution. Versioning is essential for audit trails and historical analysis. Versioning should be implemented from the start.

Poor Shard Key Selection: Choosing shard key that doesn't distribute data evenly or doesn't support query patterns, resulting in uneven load and poor performance. Shard key should distribute data evenly and should align with query patterns.

Best Practices

Appropriate Document Size: Keep documents under size limits (typically 16MB). Large data should be referenced (object storage) rather than embedded. Document size affects performance and should be monitored. Large documents should be avoided.

Strategic Embedding: Embed frequently accessed, small related data. Reference large collections and optional data. Embedding improves read performance but increases document size. Embedding strategy should be based on access patterns.

Comprehensive Indexing: Create indexes on frequently queried fields. Use compound indexes for multi-field queries. Monitor index usage and remove unused indexes. Indexing is the primary performance optimization technique.

Version Management: Implement version management from the start. Versioning should include version metadata and should enable retrieval of current and historical versions. Versioning is essential for audit trails and regulatory compliance.

Change Streams: Use change streams for real-time notifications. Change streams enable event-driven architectures and real-time synchronization. Change streams are valuable for cache invalidation and real-time updates.

Schema Validation: Implement schema validation despite flexible schemas. Validation ensures data quality and enables controlled schema evolution. Validation should be automated and should provide clear error messages.

Key Takeaways

  • Document databases are ideal for DPP data that naturally fits document structure
  • Passport documents include root, product, organization, evidence, and metadata sections
  • Flexible schemas enable evolution without downtime and accommodate diverse product types
  • Version management tracks document evolution and enables historical retrieval
  • Performance optimization requires indexing, data modeling, and query optimization
  • Data access patterns include CRUD operations, atomic updates, aggregation pipelines, and bulk operations
  • Advanced features include change streams, full-text search, time series collections, and GridFS
  • Architecture considerations include database architecture, data modeling strategy, replication, sharding, and security
  • Implementation considerations include database selection, driver selection, connection configuration, indexing, and validation
  • Common mistakes include large documents, no indexing, over-embedding, ignoring versioning, and poor shard key selection
  • Best practices include appropriate document size, strategic embedding, comprehensive indexing, version management, change streams, and schema validation