-
Notifications
You must be signed in to change notification settings - Fork 9
Description
RFC: Modular Architecture Refactor for MHTMLExtractor (v2)
Summary
This refactor will reorganize the MHTMLExtractor codebase into a more modular, extensible architecture. The goal is to isolate major features and responsibilities into separate, maintainable classes, following dependency inversion and composition over inheritance principles. This will make it easier for contributors and users to extend functionality or replace individual components without modifying core logic.
Goals
- Split monolithic logic into distinct modules and classes (e.g., parsing, decoding, resource resolution, output handling).
- Define clear interfaces or Protocol/ABC contracts for core behaviors.
- Reduce tight coupling between components and remove direct imports where composition would suffice.
- Simplify unit testing by allowing mock implementations for each subsystem.
- Provide a central factory or builder to construct configured instances with optional overrides.
- Preserve current functionality and output behavior by default.
Non-Goals
- Introducing a full dependency injection container or framework.
- Adding new end-user features unrelated to modularization.
- Redesigning CLI or API interfaces beyond what’s necessary for cleaner structure.
Motivation
The current implementation makes customization and maintenance difficult because several responsibilities are tightly coupled. By decoupling these concerns, developers can more easily:
- Extend extractors for new MIME structures or file types.
- Swap decoders or storage backends without rewriting existing code.
- Test components in isolation.
- Keep the codebase easier to reason about as features grow.
Proposed Architecture
Introduce core abstract interfaces or Protocols:
- ContentDecoder
- ResourceResolver
- AttachmentStore
- MhtmlExtractor
Provide default implementations for each interface.
Use a top-level factory to assemble the extractor from default or user-supplied components.
Organize modules under a consistent structure:
mhtmlextractor/
core/
decoders/
resolvers/
storage/
factory.py
interfaces.py
Migration and Versioning
This change will alter the internal API surface and may impact direct imports.
Version bump: v2.0.0 (breaking changes)
Implementation Plan
- Create branch refactor/modular-architecture.
- Define and document new interfaces (Protocol / ABC).
- Extract existing logic into new module classes.
- Introduce the factory/builder pattern for assembly.
- Update tests and add mocks for interfaces.
- Update documentation and examples.
- Release alpha versions (2.0.0a1, a2, etc.) for feedback.