Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 100 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,10 +99,20 @@ final spec = InferenceModelSpec(

**Runtime accepts configuration each time:**
- `maxTokens` - Context size (default: 1024)
- `preferredBackend` - CPU/GPU preference
- `preferredBackend` - Hardware backend (see PreferredBackend below)
- `supportImage` - Multimodal support
- `maxNumImages` - Image limits

**PreferredBackend enum:**
| Value | Android | iOS | Web | Desktop |
|-------|---------|-----|-----|---------|
| `cpu` | ✅ | ✅ | ❌ | ✅ |
| `gpu` | ✅ | ✅ | ✅ (required) | ✅ |
| `npu` | ✅ (.litertlm) | ❌ | ❌ | ❌ |

> - **NPU**: Qualcomm, MediaTek, Google Tensor. Up to 25x faster than CPU.
> - **Web**: GPU only (MediaPipe limitation).

**Usage:**
```dart
// Step 1: Install with identity
Expand Down Expand Up @@ -515,6 +525,74 @@ use_frameworks! :linkage => :static
<uses-native-library android:name="libOpenCL-pixel.so" android:required="false"/>
```

#### Android LiteRT-LM Engine (v0.12.x+)

Android now supports **dual inference engines** - MediaPipe and LiteRT-LM - with automatic selection based on file extension.

**Engine Selection:**
| File Extension | Engine | Android | Desktop | Web |
|----------------|--------|---------|---------|-----|
| `.task`, `.bin`, `.tflite` | MediaPipe | Yes | No | Yes |
| `.litertlm` | LiteRT-LM | Yes | Yes | No |

**Architecture:**
```
android/src/main/kotlin/dev/flutterberlin/flutter_gemma/
├── FlutterGemmaPlugin.kt # Plugin entry point
├── PlatformService.g.kt # Pigeon-generated interface
└── engines/ # Engine abstraction layer
├── InferenceEngine.kt # Strategy interface
├── InferenceSession.kt # Session interface
├── EngineConfig.kt # Configuration + SessionConfig + FlowFactory
├── EngineFactory.kt # Factory for engine creation
├── mediapipe/
│ ├── MediaPipeEngine.kt # MediaPipe adapter (wraps LlmInference)
│ └── MediaPipeSession.kt # MediaPipe session adapter
└── litertlm/
├── LiteRtLmEngine.kt # LiteRT-LM implementation
└── LiteRtLmSession.kt # LiteRT-LM session with chunk buffering
```

**Key Design Decisions:**

1. **Strategy Pattern**: `InferenceEngine` interface allows interchangeable engine implementations
2. **Adapter Pattern**: `MediaPipeEngine` wraps existing MediaPipe code without modifications
3. **Chunk Buffering**: LiteRT-LM uses `sendMessage()` not `addQueryChunk()`, so `LiteRtLmSession` buffers chunks in `StringBuilder` and sends complete message on `generateResponse()`

**LiteRT-LM Limitations:**

⚠️ **Token Counting**: LiteRT-LM SDK does not expose tokenizer API. The implementation uses an estimate of ~4 characters per token with a warning log:
```kotlin
Log.w(TAG, "sizeInTokens: LiteRT-LM does not support token counting. " +
"Using estimate (~4 chars/token): $estimate tokens for ${prompt.length} chars. " +
"This may be inaccurate for non-English text.")
```

⚠️ **Cancellation**: `cancelGeneration()` is not yet supported by LiteRT-LM SDK 0.9.x

**LiteRT-LM Behavioral Differences:**

1. **Chunk Buffering**: Unlike MediaPipe which processes `addQueryChunk()` directly, LiteRT-LM buffers chunks in `StringBuilder` and sends complete message on `generateResponse()`.
2. **Thread-Safe Accumulation**: Uses `synchronized(promptLock)` for safe concurrent chunk additions.
3. **Cache Support**: Engine configured with `cacheDir` for faster reloads (~10s cold → ~1-2s cached).

**Dependency (build.gradle):**
```gradle
implementation 'com.google.ai.edge.litertlm:litertlm-android:0.9.0-alpha01'
```

**Usage (Dart - no changes required):**
```dart
// Engine is automatically selected based on file extension
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
.fromNetwork('https://example.com/model.litertlm') // → LiteRtLmEngine
.install();

await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
.fromNetwork('https://example.com/model.task') // → MediaPipeEngine
.install();
```

#### Web Configuration
```html
<!-- index.html -->
Expand Down Expand Up @@ -1080,7 +1158,27 @@ flutter_gemma/
└── CLAUDE.md # This file
```

## Recent Updates (2026-01-01)
## Recent Updates (2026-01-18)

### ✅ Android LiteRT-LM Engine (v0.12.x+)
- **Dual Engine Support** - MediaPipe and LiteRT-LM on Android
- **Automatic Selection** - Engine chosen by file extension (`.litertlm` → LiteRT-LM, `.task/.bin` → MediaPipe)
- **Strategy Pattern** - `InferenceEngine` interface with interchangeable implementations
- **Adapter Pattern** - `MediaPipeEngine` wraps existing code without modifications
- **Chunk Buffering** - LiteRT-LM session buffers `addQueryChunk()` calls for `sendMessage()` API
- **Token Estimation** - ~4 chars/token with warning log (LiteRT-LM lacks tokenizer API)
- **Zero Flutter API Changes** - Transparent to Dart layer

**Key Files:**
- `android/.../engines/InferenceEngine.kt` - Strategy interface
- `android/.../engines/EngineFactory.kt` - Factory for engine creation
- `android/.../engines/mediapipe/` - MediaPipe adapter
- `android/.../engines/litertlm/` - LiteRT-LM implementation

**Dependency:**
```gradle
implementation 'com.google.ai.edge.litertlm:litertlm-android:0.9.0-alpha01'
```

### ✅ Desktop Platform Support (v0.12.0+)
- **macOS, Windows, Linux** support via LiteRT-LM JVM
Expand Down
15 changes: 13 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1086,6 +1086,18 @@ final inferenceModel = await FlutterGemmaPlugin.instance.createModel(
);
```

**PreferredBackend Options:**

| Backend | Android | iOS | Web | Desktop |
|---------|---------|-----|-----|---------|
| `cpu` | ✅ | ✅ | ❌ | ✅ |
| `gpu` | ✅ | ✅ | ✅ (required) | ✅ |
| `npu` | ✅ (.litertlm) | ❌ | ❌ | ❌ |

- **NPU**: Qualcomm AI Engine, MediaTek NeuroPilot, Google Tensor. Up to 25x faster than CPU.
- **Web**: GPU only (MediaPipe limitation). CPU models will fail to initialize.
- **Desktop**: GPU uses Metal (macOS), DirectX 12 (Windows), Vulkan (Linux).

6.**Using Sessions for Single Inferences:**

If you need to generate individual responses without maintaining a conversation history, use sessions. Sessions allow precise control over inference and must be properly closed to avoid memory leaks.
Expand Down Expand Up @@ -2007,8 +2019,7 @@ final supported = await FlutterGemma.isStreamingSupported();
```

#### Backend Support
- **GPU only:** Web platform requires GPU backend (MediaPipe limitation)
- **CPU models:** ❌ Will fail to initialize on web
- **GPU only:** See [PreferredBackend Options](#preferredbackend-options) table above

#### CORS Configuration
- **Required for custom servers:** Enable CORS headers on your model hosting server
Expand Down
3 changes: 3 additions & 0 deletions android/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,9 @@ dependencies {
implementation 'com.google.guava:guava:33.3.1-android'
implementation 'org.jetbrains.kotlinx:kotlinx-coroutines-guava:1.9.0'

// LiteRT-LM Engine for .litertlm model files
implementation 'com.google.ai.edge.litertlm:litertlm-android:0.9.0-alpha01'

implementation 'androidx.core:core-ktx:1.12.0'
implementation 'androidx.lifecycle:lifecycle-runtime-ktx:2.7.0'
testImplementation 'org.jetbrains.kotlin:kotlin-test'
Expand Down
Loading
Loading