native-llm - v0.2.0
    Preparing search index...

    Class LLMEngine

    Native LLM Engine

    Provides text generation using llama.cpp with Metal GPU acceleration.

    Index

    Constructors

    Methods

    • Check if we're running on a supported platform

      Returns boolean

    • Initialize the engine and load the model

      Downloads the model from HuggingFace if not cached locally. Uses Metal GPU acceleration on Apple Silicon.

      Returns Promise<void>

      Error if model download or loading fails

      const engine = new LLMEngine({ model: "gemma-3n-e4b" })
      await engine.initialize()
    • Generate text from a prompt

      Automatically initializes the engine if not already done. For thinking-mode models (Qwen3, DeepSeek), applies appropriate settings.

      Parameters

      • options: GenerateOptions

        Generation options including prompt, maxTokens, temperature

      Returns Promise<GenerateResult>

      Generation result with text, token counts, and performance metrics

      const result = await engine.generate({
      prompt: "Explain quantum computing",
      maxTokens: 200,
      temperature: 0.7
      })
      console.log(result.text)
      console.log(`${result.tokensPerSecond.toFixed(1)} tok/s`)
    • Generate text with streaming token-by-token output

      Same as generate() but calls onToken for each generated token, enabling real-time display of responses.

      Parameters

      • options: GenerateOptions

        Generation options including prompt, maxTokens, temperature

      • onToken: TokenCallback

        Callback invoked for each generated token

      Returns Promise<GenerateResult>

      Generation result with text, token counts, and performance metrics

      const result = await engine.generateStreaming(
      { prompt: "Write a haiku" },
      (token) => process.stdout.write(token)
      )
    • Generate text using chat message format

      Supports multi-turn conversations with system, user, and assistant messages. Automatically manages chat history within the session.

      Parameters

      • messages: { role: "system" | "user" | "assistant"; content: string }[]

        Array of chat messages with role and content

      • Optionaloptions: Omit<GenerateOptions, "prompt" | "systemPrompt">

        Optional generation options (maxTokens, temperature, etc.)

      Returns Promise<GenerateResult>

      Generation result with assistant's response

      const result = await engine.chat([
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: "What is 2+2?" }
      ])
      console.log(result.text) // "4"
    • Get information about the current model

      Returns model metadata including name, parameters, context length, supported languages, and benchmark scores.

      Returns
          | {
              name: "Gemma 3n E2B";
              repo: "unsloth/gemma-3n-E2B-it-GGUF";
              file: "gemma-3n-E2B-it-Q4_K_M.gguf";
              parameters: "5B→2B";
              quantization: "Q4_K_M";
              contextLength: 32768;
              languages: readonly [
                  "en",
                  "de",
                  "fr",
                  "es",
                  "it",
                  "pt",
                  "nl",
                  "pl",
                  "ru",
                  "ja",
                  "ko",
                  "zh",
              ];
              description: "Ultra-efficient edge model, ~2GB RAM";
              requiresAuth: false;
              benchmarks: { mmlu: 64; arena: 1250 };
          }
          | {
              name: "Gemma 3n E4B";
              repo: "unsloth/gemma-3n-E4B-it-GGUF";
              file: "gemma-3n-E4B-it-Q4_K_M.gguf";
              parameters: "8B→4B";
              quantization: "Q4_K_M";
              contextLength: 32768;
              languages: readonly [
                  "en",
                  "de",
                  "fr",
                  "es",
                  "it",
                  "pt",
                  "nl",
                  "pl",
                  "ru",
                  "ja",
                  "ko",
                  "zh",
              ];
              description: "Best edge model, ~3GB RAM";
              requiresAuth: false;
              benchmarks: { mmlu: 75; arena: 1300 };
          }
          | {
              name: "Gemma 3 27B";
              repo: "unsloth/gemma-3-27b-it-GGUF";
              file: "gemma-3-27b-it-Q4_K_M.gguf";
              parameters: "27B";
              quantization: "Q4_K_M";
              contextLength: 131072;
              languages: readonly [
                  "en",
                  "de",
                  "fr",
                  "es",
                  "it",
                  "pt",
                  "nl",
                  "pl",
                  "ru",
                  "ja",
                  "ko",
                  "zh",
              ];
              description: "Maximum quality, 128K context, ~18GB RAM";
              benchmarks: { mmlu: 77; arena: 1338 };
          }
          | {
              name: "GPT-OSS 20B";
              repo: "unsloth/gpt-oss-20b-GGUF";
              file: "gpt-oss-20b-Q4_K_M.gguf";
              parameters: "21B (3.6B active)";
              quantization: "Q4_K_M";
              contextLength: 131072;
              languages: readonly ["en"];
              description: "OpenAI's open model, MoE, ~16GB RAM";
              benchmarks: { mmlu: 82; arena: 1340 };
          }
          | {
              name: "Phi-4 14B";
              repo: "bartowski/phi-4-GGUF";
              file: "phi-4-Q4_K_M.gguf";
              parameters: "14B";
              quantization: "Q4_K_M";
              contextLength: 16384;
              languages: readonly ["en"];
              description: "Microsoft's reasoning-focused, excellent for STEM";
              benchmarks: { mmlu: 84; arena: 1320 };
          }
          | {
              name: "Qwen3 4B";
              repo: "unsloth/Qwen3-4B-GGUF";
              file: "Qwen3-4B-Q4_K_M.gguf";
              parameters: "4B";
              quantization: "Q4_K_M";
              contextLength: 32768;
              languages: readonly [
                  "en",
                  "zh",
                  "de",
                  "fr",
                  "es",
                  "pt",
                  "it",
                  "nl",
                  "pl",
                  "ru",
                  "ja",
                  "ko",
              ];
              description: "Thinking mode, 100+ languages, ~3GB RAM";
              thinkingMode: "qwen";
              benchmarks: { mmlu: 76; arena: 1300 };
          }
          | {
              name: "Qwen3 8B";
              repo: "unsloth/Qwen3-8B-GGUF";
              file: "Qwen3-8B-Q4_K_M.gguf";
              parameters: "8B";
              quantization: "Q4_K_M";
              contextLength: 32768;
              languages: readonly [
                  "en",
                  "zh",
                  "de",
                  "fr",
                  "es",
                  "pt",
                  "it",
                  "nl",
                  "pl",
                  "ru",
                  "ja",
                  "ko",
              ];
              description: "Thinking mode, excellent multilingual, ~5GB RAM";
              thinkingMode: "qwen";
              benchmarks: { mmlu: 81; arena: 1350 };
          }
          | {
              name: "Qwen3 14B";
              repo: "unsloth/Qwen3-14B-GGUF";
              file: "Qwen3-14B-Q4_K_M.gguf";
              parameters: "14B";
              quantization: "Q4_K_M";
              contextLength: 32768;
              languages: readonly [
                  "en",
                  "zh",
                  "de",
                  "fr",
                  "es",
                  "pt",
                  "it",
                  "nl",
                  "pl",
                  "ru",
                  "ja",
                  "ko",
              ];
              description: "Thinking mode, top multilingual, ~9GB RAM";
              thinkingMode: "qwen";
              benchmarks: { mmlu: 84; arena: 1380 };
          }
          | {
              name: "Qwen 2.5 Coder 7B";
              repo: "bartowski/Qwen2.5-Coder-7B-Instruct-GGUF";
              file: "Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf";
              parameters: "7B";
              quantization: "Q4_K_M";
              contextLength: 131072;
              languages: readonly ["en"];
              description: "Optimized for code generation";
              benchmarks: { mmlu: 66; arena: 1250 };
          }
          | {
              name: "DeepSeek R1 Distill 7B";
              repo: "bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF";
              file: "DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf";
              parameters: "7B";
              quantization: "Q4_K_M";
              contextLength: 131072;
              languages: readonly ["en", "zh"];
              description: "Strong reasoning with chain-of-thought";
              thinkingMode: "deepseek";
              benchmarks: { mmlu: 72; arena: 1300 };
          }
          | {
              name: "DeepSeek R1 Distill 14B";
              repo: "bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF";
              file: "DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf";
              parameters: "14B";
              quantization: "Q4_K_M";
              contextLength: 131072;
              languages: readonly ["en", "zh"];
              description: "Best reasoning model, shows thinking";
              thinkingMode: "deepseek";
              benchmarks: { mmlu: 79; arena: 1350 };
          }
          | {
              name: string;
              repo: string;
              file: string;
              parameters: string;
              quantization: string;
              contextLength: number;
              languages: string[];
              description: string;
          }

      Model information object

    • Reset the chat session

      Clears all conversation history, starting fresh for new conversations. The model remains loaded; use dispose() to fully unload.

      Returns void

    • Clean up resources and unload the model

      Releases GPU memory and cleans up native resources. Call this when done with the engine to prevent memory leaks.

      Returns Promise<void>

      const engine = new LLMEngine({ model: "gemma-3n-e4b" })
      try {
      await engine.initialize()
      const result = await engine.generate({ prompt: "Hello" })
      } finally {
      await engine.dispose()
      }
    • List all available curated models

      Returns (
          { id: string } & (
              | {
                  name: "Gemma 3n E2B";
                  repo: "unsloth/gemma-3n-E2B-it-GGUF";
                  file: "gemma-3n-E2B-it-Q4_K_M.gguf";
                  parameters: "5B→2B";
                  quantization: "Q4_K_M";
                  contextLength: 32768;
                  languages: readonly [
                      "en",
                      "de",
                      "fr",
                      "es",
                      "it",
                      "pt",
                      "nl",
                      "pl",
                      "ru",
                      "ja",
                      "ko",
                      "zh",
                  ];
                  description: "Ultra-efficient edge model, ~2GB RAM";
                  requiresAuth: false;
                  benchmarks: { mmlu: 64; arena: 1250 };
              }
              | {
                  name: "Gemma 3n E4B";
                  repo: "unsloth/gemma-3n-E4B-it-GGUF";
                  file: "gemma-3n-E4B-it-Q4_K_M.gguf";
                  parameters: "8B→4B";
                  quantization: "Q4_K_M";
                  contextLength: 32768;
                  languages: readonly [
                      "en",
                      "de",
                      "fr",
                      "es",
                      "it",
                      "pt",
                      "nl",
                      "pl",
                      "ru",
                      "ja",
                      "ko",
                      "zh",
                  ];
                  description: "Best edge model, ~3GB RAM";
                  requiresAuth: false;
                  benchmarks: { mmlu: 75; arena: 1300 };
              }
              | {
                  name: "Gemma 3 27B";
                  repo: "unsloth/gemma-3-27b-it-GGUF";
                  file: "gemma-3-27b-it-Q4_K_M.gguf";
                  parameters: "27B";
                  quantization: "Q4_K_M";
                  contextLength: 131072;
                  languages: readonly [
                      "en",
                      "de",
                      "fr",
                      "es",
                      "it",
                      "pt",
                      "nl",
                      "pl",
                      "ru",
                      "ja",
                      "ko",
                      "zh",
                  ];
                  description: "Maximum quality, 128K context, ~18GB RAM";
                  benchmarks: { mmlu: 77; arena: 1338 };
              }
              | {
                  name: "GPT-OSS 20B";
                  repo: "unsloth/gpt-oss-20b-GGUF";
                  file: "gpt-oss-20b-Q4_K_M.gguf";
                  parameters: "21B (3.6B active)";
                  quantization: "Q4_K_M";
                  contextLength: 131072;
                  languages: readonly ["en"];
                  description: "OpenAI's open model, MoE, ~16GB RAM";
                  benchmarks: { mmlu: 82; arena: 1340 };
              }
              | {
                  name: "Phi-4 14B";
                  repo: "bartowski/phi-4-GGUF";
                  file: "phi-4-Q4_K_M.gguf";
                  parameters: "14B";
                  quantization: "Q4_K_M";
                  contextLength: 16384;
                  languages: readonly ["en"];
                  description: "Microsoft's reasoning-focused, excellent for STEM";
                  benchmarks: { mmlu: 84; arena: 1320 };
              }
              | {
                  name: "Qwen3 4B";
                  repo: "unsloth/Qwen3-4B-GGUF";
                  file: "Qwen3-4B-Q4_K_M.gguf";
                  parameters: "4B";
                  quantization: "Q4_K_M";
                  contextLength: 32768;
                  languages: readonly [
                      "en",
                      "zh",
                      "de",
                      "fr",
                      "es",
                      "pt",
                      "it",
                      "nl",
                      "pl",
                      "ru",
                      "ja",
                      "ko",
                  ];
                  description: "Thinking mode, 100+ languages, ~3GB RAM";
                  thinkingMode: "qwen";
                  benchmarks: { mmlu: 76; arena: 1300 };
              }
              | {
                  name: "Qwen3 8B";
                  repo: "unsloth/Qwen3-8B-GGUF";
                  file: "Qwen3-8B-Q4_K_M.gguf";
                  parameters: "8B";
                  quantization: "Q4_K_M";
                  contextLength: 32768;
                  languages: readonly [
                      "en",
                      "zh",
                      "de",
                      "fr",
                      "es",
                      "pt",
                      "it",
                      "nl",
                      "pl",
                      "ru",
                      "ja",
                      "ko",
                  ];
                  description: "Thinking mode, excellent multilingual, ~5GB RAM";
                  thinkingMode: "qwen";
                  benchmarks: { mmlu: 81; arena: 1350 };
              }
              | {
                  name: "Qwen3 14B";
                  repo: "unsloth/Qwen3-14B-GGUF";
                  file: "Qwen3-14B-Q4_K_M.gguf";
                  parameters: "14B";
                  quantization: "Q4_K_M";
                  contextLength: 32768;
                  languages: readonly [
                      "en",
                      "zh",
                      "de",
                      "fr",
                      "es",
                      "pt",
                      "it",
                      "nl",
                      "pl",
                      "ru",
                      "ja",
                      "ko",
                  ];
                  description: "Thinking mode, top multilingual, ~9GB RAM";
                  thinkingMode: "qwen";
                  benchmarks: { mmlu: 84; arena: 1380 };
              }
              | {
                  name: "Qwen 2.5 Coder 7B";
                  repo: "bartowski/Qwen2.5-Coder-7B-Instruct-GGUF";
                  file: "Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf";
                  parameters: "7B";
                  quantization: "Q4_K_M";
                  contextLength: 131072;
                  languages: readonly ["en"];
                  description: "Optimized for code generation";
                  benchmarks: { mmlu: 66; arena: 1250 };
              }
              | {
                  name: "DeepSeek R1 Distill 7B";
                  repo: "bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF";
                  file: "DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf";
                  parameters: "7B";
                  quantization: "Q4_K_M";
                  contextLength: 131072;
                  languages: readonly ["en", "zh"];
                  description: "Strong reasoning with chain-of-thought";
                  thinkingMode: "deepseek";
                  benchmarks: { mmlu: 72; arena: 1300 };
              }
              | {
                  name: "DeepSeek R1 Distill 14B";
                  repo: "bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF";
                  file: "DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf";
                  parameters: "14B";
                  quantization: "Q4_K_M";
                  contextLength: 131072;
                  languages: readonly ["en", "zh"];
                  description: "Best reasoning model, shows thinking";
                  thinkingMode: "deepseek";
                  benchmarks: { mmlu: 79; arena: 1350 };
              }
          )
      )[]

      Array of model information objects

      const models = LLMEngine.listModels()
      models.forEach(m => console.log(`${m.id}: ${m.name} (${m.parameters})`))
    • Get recommended model for a specific use case

      Parameters

      • useCase:
            | "fast"
            | "balanced"
            | "quality"
            | "edge"
            | "multilingual"
            | "reasoning"
            | "code"
            | "longContext"

        One of: fast, balanced, quality, edge, multilingual, reasoning, code, longContext

      Returns
          | "gemma-3n-e2b"
          | "gemma-3n-e4b"
          | "gemma-3-27b"
          | "gpt-oss-20b"
          | "phi-4"
          | "qwen3-4b"
          | "qwen3-8b"
          | "qwen3-14b"
          | "qwen-2.5-coder-7b"
          | "deepseek-r1-7b"
          | "deepseek-r1-14b"

      Model ID string

      const modelId = LLMEngine.getModelForUseCase("code")
      const engine = new LLMEngine({ model: modelId })