Claude Code running with devstral-128k via Ollama

I Tested 18 Local Models So You Don’t Have To

Ollama released Anthropic API compatibility in January 2026, so I tested 18 local models with Claude Code to find out which ones actually work for agentic coding tasks.

TL;DR

devstral-small-2:24b is the winner - best quality, fastest, zero interventions

You MUST configure context window - Ollama defaults to 4K; use 64K minimum

Expect 12-24 min for tasks that take ~2 min with Opus 4.5 - but it works!

Ollama docs: https://docs.ollama.com/integrations/claude-code
Anthropic API compatibility: https://docs.ollama.com/api/anthropic-compatibility

My Setup

Spec	Value
Machine	MacBook Pro
Chip	Apple M4 Pro
RAM	48 GB unified memory
Ollama	v0.14.2

Models

Here’s everything I tested, sorted by size:

Model	Size	Release	SWE-bench	Type
nemotron-3-nano:30b	24GB	Dec 2025	-	MoE
cogito:32b	20GB	Jul 2025	-	Hybrid reasoning
granite4:32b-a9b-h	~20GB	Oct 2025	-	General-purpose
command-r:35b	19GB	Mar 2024	-	RAG-optimized
qwen2.5-coder:32b	19GB	Nov 2024	9.0%	Coding
deepseek-r1:32b	19GB	Jan 2025	41.4%	Reasoning
qwen3-coder:30b	18GB	Jul 2025	51.6%	Coding
qwen3:30b	18GB	Apr 2025	-	General-purpose
devstral-small-2:24b	15GB	Dec 2025	68.0%	Agentic coding
mistral-small3.2:24b	15GB	Jun 2025	-	General-purpose
magistral:24b	14GB	Jun 2025	-	Reasoning
gpt-oss:20b	14GB	Aug 2025	-	General-purpose
cogito:14b	9GB	Jul 2025	-	Hybrid reasoning
deepseek-coder-v2:16b	8.9GB	Jun 2024	-	Coding (no tools)
rnj-1:8b	5.1GB	Dec 2025	20.8%	General-purpose
phi4-mini:3.8b	2.5GB	Feb 2025	-	General-purpose
granite4:3b	2.1GB	Oct 2025	-	General-purpose
functiongemma:270m	301MB	Dec 2025	-	Function calling

Experiments

I chose a very simple task: run /init on a repo (jupyterlab-latex) to generate CLAUDE.md, which is normally the first thing I do in a new repo. It’s deceptively hard though - the model has to discover tools, explore multiple files, and synthesize documentation without hallucinating. One or two runs per model; treat results as field notes.

My first two models (nemotron, gpt-oss) used Ollama’s default context window - which is how I discovered the 4K limit issue. After that, I set context to 64K+ in Ollama’s settings.

`nemotron-3-nano:30b`

My first attempt revealed a critical failure mode. With the default context window, the model’s thinking block explicitly shows it decided to skip reading files entirely:

“We don’t have details of repo… There haven’t been any reads yet… Let’s assume typical repo structure”

Instead of using tools to explore, it fabricated an entire codebase structure. The output described a React/Node.js monorepo with /frontend and /backend directories - neither of which exist in jupyterlab-latex (a Python/TypeScript JupyterLab extension). It invented commands like npm run dev and referenced non-existent config files.

This failure led me to discover Ollama’s default 4K context limit. After configuring a 128K context window, subsequent attempts worked much better:

Read → Glob → Read → Read → Read → Read → Glob → Read → Write

The model properly explored the codebase, but still stopped mid-task and required a follow-up prompt (“Continue”) to finish. Final output was accurate and high quality - proving the model can work, but context configuration is critical.

`gpt-oss:20b`

Also tested early with the default context window. Fast but unreliable:

Direct prompt: Finished quickly but low quality output
/init skill: Tool parameter errors, empty results, needed intervention

Sautéed for 2m 37s  (Claude Code's task timer)

`devstral-small-2:24b` ⭐ Winner

With 128K context configured from the start, this was a perfect run. The model immediately understood the task:

“I’ll analyze this codebase and create a CLAUDE.md file with the essential information for future instances.”

Tool call sequence shows direct, confident tool usage:

Bash → Bash → Bash → Read → Bash → Bash → Bash → Read → Read → Read → Bash → Write

No confusion about subagents or tool parameters - it went straight for Bash and Read to explore the codebase, then used Write to create the output.

The output was 180 lines of documentation with actual function names, Python config examples, and a 5-step communication flow diagram. Every file reference checked out - no hallucinations.

Why did devstral outperform? Mistral trained it specifically for SWE-Bench (68.0% score) and tool-use scenarios. You can see it in the tool calls - direct and confident, no subagent confusion.

Sautéed for 17m 12s

`qwen3-coder:30b`

Also configured with 128K context. The model’s first instinct was to delegate to a subagent. From the session trace, it tried to spawn an Explore agent twice:

{
  "description": "Explore codebase structure",
  "prompt": "Explore the structure of this JupyterLab LaTeX extension repository...",
  "subagent_type": "Explore"
}

This isn’t an Ollama bug, but a mismatch between what Claude Code can do in a given environment and what the model decides to attempt. Claude Code has a notion of subagents (like an “Explore” helper), but in my setup those weren’t available/configured, so that tool call fails. Ollama’s docs do advertise Claude Code usage, though, so it’s worth calling out explicitly: with third-party models, you should expect occasional “tooling weirdness” like this even if the transport API is compatible.

When the Task tool failed (subagents weren’t configured), qwen3-coder adapted gracefully. Tool sequence shows the recovery:

Task → Task → Bash → Read → Read → Read → Read → Read → Read → Read → Read → Write

After two failed Explore attempts, it switched to direct Bash and Read tools and completed the task without further intervention. Output quality was good - accurate, no hallucinations, but less detailed than devstral (86 lines vs 180).

Sautéed for 23m 48s

`granite4:32b-a9b-h`

An interesting comparison point - this is IBM’s general-purpose 32B model, not a coding specialist. With 128K context configured, it completed the task in under 7 minutes - the fastest successful run.

The trade-off: minimal exploration. Tool sequence:

Read → Write

Just two tool calls - read the README, write CLAUDE.md. No codebase exploration, no package.json check, no architecture analysis. The output was decent:

✅ Correct project type (JupyterLab LaTeX extension)
✅ Correct commands (jlpm run build, jlpm run watch)
✅ Mermaid architecture diagram
⚠️ Some hallucinated details (referenced src/components/Toolbar.tsx without verifying it exists)

At 32K context, it stalled - started correctly (Glob → Read), but got stuck after reading files and never produced output. A different failure mode than devstral’s 32K hallucination.

Verdict: Works, but lazy. General-purpose models can complete agentic tasks but tend to “wing it” with minimal tool use, while coding specialists explore more thoroughly.

Sautéed for ~7m

`qwen3:30b`

The general-purpose Qwen3 (not the coder variant). This was the worst performer - pure hallucination with zero exploration.

Tool sequence:

Write

Just one tool call. The thinking block is revealing - it explicitly acknowledged it couldn’t see files but proceeded anyway:

“Since I can’t actually see the files, I’ll have to rely on the context provided.”

It inferred file structure from git status in the system prompt, then fabricated everything:

❌ python jupyterlab_latex/build.py - wrong command (should be jlpm run build)
❌ latex_cleanup.py - fabricated filename
❌ flake8 - assumed linter without checking

At 128K context, it consumed 31GB RAM (vs 18GB on disk) - pushing my 48GB system into swap. The memory pressure may have contributed to its laziness, but the thinking block shows it consciously chose to guess rather than explore.

Key finding: The coder fine-tuning isn’t just about coding knowledge - it teaches the model to actually use tools instead of guessing. qwen3-coder explored properly; qwen3 base hallucinated everything.

Sautéed for ~5m

`qwen2.5-coder:32b`

Failed. Despite having 128K context configured, through multiple attempts it kept reaching for the Explore subagent tool and then abruptly stopping without completing any work. Unlike qwen3-coder which recovered when Explore failed, qwen2.5-coder couldn’t adapt. Same model family, different generation, completely different behavior when things go wrong.

`mistral-small3.2:24b`

Failed - hallucinated tool parameters. This model understands it should use tools but invents wrong parameter schemas. From the session trace, it tried to call the Task tool with made-up parameters:

// Attempt 1:
{"instruction": "...", "max_depth": 100}

// Attempt 2:
{"subagent_name": "Explore", "subagent_type": "Explore", "subagent_prompt": "..."}

The actual required parameters are description and prompt. When it received clear error messages explaining this, it simply repeated “I’m going to use the Task tool…” and stopped - unable to self-correct.

This is a different failure mode than hallucinating content (qwen3) or refusing (functiongemma). The model has learned about tools but not the actual invocation format. Worth noting: devstral-small-2 is also a Mistral model and works perfectly - the difference is devstral’s agentic specialization.

Memory: 37GB loaded at 128K context (vs 15GB on disk).

`magistral:24b`

Failed - narrated tools instead of invoking them. This new Mistral reasoning model understood the task and knew which tools to use, but wrote out tool calls as text instead of actually executing them:

"Let me use the Glob tool to find these patterns:

```bash
Glob pattern: **/README.md
Glob pattern: .github/readme*
...
```

Now that I have the relevant files, let's analyze..."

Zero actual tool calls were made. The model described what it would do, assumed the tools had run, and proceeded to the next step. This suggests training on tool documentation without actual tool-use interactions.

Memory: 23GB loaded at 128K context (vs 14GB on disk).

Native context limitation: magistral’s native context is only 39K. Even with Ollama allocating 128K, the model may not effectively use context beyond its training limit - which could explain why it never received the tool invocation format.

`cogito:32b`

Failed - memory issues and context-limited stall. This hybrid reasoning model has different failure modes depending on context configuration:

At 128K context: Loaded 64GB into memory (41% CPU / 59% GPU split). On my 48GB system, this caused severe memory thrashing - spiky memory pressure, swap usage, and zero tokens produced after 5+ minutes.

At 64K context: Loaded 42GB (8% CPU / 92% GPU). Still tight but runnable. Same stalling behavior.

At 32K context: Loaded 30GB (100% GPU). Actually started working! Made correct Glob and Read calls, explored the codebase properly:

Glob → Read README.md → "Let me create a todo list..."

But then it just… stopped. Said “Let me start with writing the overview section first” and ended without writing anything. Even nudging with “continue” prompt didn’t help - completely stuck.

This is the same pattern as granite4:32b at 32K context: can explore but can’t complete. 32K context is insufficient for task completion - the model loses track of the goal mid-execution.

`cogito:14b`

Failed - multiple tool issues. Testing the smaller cogito variant to see if the 7-15B range had any surprises. It did, but not good ones.

Memory: Even at 9GB on disk, loaded 45GB at 128K context with 15% CPU offload. At 64K context it was more manageable.

Tool sequence shows multiple failure modes:

Read README.md ✅ → Read copilot-instructions.md ✅ (not found) →
WebSearch ❌ (hallucinated) → TodoWrite ❌ (wrong params, twice) →
Printed CLAUDE.md as text ⚠️

Hallucinated WebSearch - tool doesn’t exist in Claude Code, got empty results
Wrong TodoWrite params - missing required activeForm field, tried twice without learning
Never used Write tool - just printed the CLAUDE.md content as markdown text instead of writing to file

The generated content was actually reasonable - correct commands, accurate architecture. But the model “completed” the task by printing output rather than writing the file. It understood the goal but couldn’t execute properly.

Time: ~7.7 minutes

The cogito family (both 32b and 14b) consistently fails with Claude Code’s tool schemas - different sizes, different failure modes, same outcome.

`command-r:35b`

Failed - nested tool parameter schema. The last untested model in the viable 15-35B range. At 128K context it didn’t fit on my GPU. At 64K and 32K it loaded but failed with the same tool schema issue.

From the trace, the model wrapped all tool parameters in a nested structure:

{
  "tool_name": "Task",
  "parameters": {
    "description": "...",
    "prompt": "...",
    "subagent_type": "general-purpose"
  }
}

The correct format is flat parameters at the top level. It made 4 tool calls (3 Task, 1 TodoWrite) - all failed with validation errors like “required parameter description is missing” because the nesting caused parameters to be undefined at the expected level.

Unlike mistral-small3.2 which invented wrong parameter names, command-r uses the correct parameter names but wraps them incorrectly. When it received validation errors, it didn’t retry - just output a text-based “Action Plan” and stopped.

This suggests Cohere’s tool-calling format differs from the Anthropic API schema. The model was trained on a different tool invocation structure.

Context comparison:

32K: 4 tool calls, all failed, gave up quickly (~7 min)
64K: 29 tool calls, all failed, kept retrying same broken schema (~9.5 min)

More context didn’t help - it just gave the model more runway to keep failing the same way. It never learned from the error messages.

Results

✅ Worked

Model	Quality	Time	Notes
devstral-small-2 ⭐	Excellent	17 min	No hallucinations, no interventions
qwen3-coder	Good	24 min	Recovered after Explore failed
granite4:32b	Good	~7 min	Fast but lazy, minor hallucinations*

⚠️ Completed With Issues

Model	Quality	Time	Issue
gpt-oss:20b	Low	~3 min	Needed intervention
nemotron-3-nano	Mixed	-	Hallucinated on first attempt
qwen3:30b	Poor	~5 min	Zero tool calls, fabricated everything

❌ Failed

Model	Time	Failure Mode
qwen2.5-coder:32b	-	Stuck on Explore subagent
mistral-small3.2:24b	-	Wrong tool parameter schema
magistral:24b	-	Narrated tools instead of invoking
cogito:32b	-	Memory thrashing, context stall
cogito:14b	~8 min	Hallucinated WebSearch tool
command-r:35b	7-10 min	Nested tool parameters
deepseek-r1:32b	-	No tool support in Ollama
deepseek-coder-v2:16b	-	No tool support in Ollama
functiongemma:270m	-	Refuses everything
granite4:3b	-	Hallucinates without tools
phi4-mini:3.8b	-	Invents fake tool names
rnj-1:8b	-	Silent, zero output

*granite4:32b referenced files it never verified existed. It “works” in the sense that it completes the task and produces usable output, but you’d want to review it before trusting it. devstral and qwen3-coder are trustworthy out of the box.

Winner: devstral-small-2 - best quality, smallest footprint, zero interventions.

Model Outputs

Compare the actual CLAUDE.md files generated by each model. Use the tabs to switch between models, or click the side-by-side button to compare them directly:

devstral-small-2

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is jupyterlab-latex, a JupyterLab extension for live-editing of LaTeX documents. It provides:

Real-time LaTeX compilation and PDF preview
SyncTeX support for editor-PDF synchronization
Writing tools (text formatting, lists, tables, plots)
Mathematical constants and symbols insertion

The extension consists of:

Frontend (TypeScript): JupyterLab UI components and plugins
Backend (Python): Server extension for LaTeX compilation

Development Setup

Prerequisites

Node.js (for building TypeScript)
Python 3.8+
JupyterLab 4.0+
LaTeX distribution (e.g., TeX Live, MiKTeX) with pdflatex, xelatex, or tectonic

Installation

# Development install
pip install -e .

# Link with JupyterLab
jupyter labextension develop . --overwrite
jupyter server extension enable jupyterlab_latex

# Build TypeScript
jlpm run build

Building

jlpm run build: Build TypeScript and labextension
jlpm run build:prod: Production build (clean + minified)
jlpm run watch: Watch mode for development
jupyter labextension build .: Build labextension

Linting

jlpm run lint: Run ESLint, Prettier, and Stylelint
jlpm run eslint: Run ESLint only
jlpm run prettier: Run Prettier formatting
jlpm run stylelint: Run Stylelint

Cleaning

jlpm run clean: Clean build artifacts
jlpm run clean:all: Clean everything (lib, labextension, lint cache)

Architecture

Frontend Structure (TypeScript)

Location: src/

index.ts: Main plugin entry point
- Defines JupyterFrontEndPlugin for LaTeX extension
- Sets up commands, toolbar buttons, and context menu items
- Manages PDF preview and SyncTeX synchronization
pdf.ts: PDF viewer components
- PDFJSViewer: PDF rendering widget
- PDFJSViewerFactory: Document registry factory
- PDFJSDocumentWidget: Widget wrapper
error.tsx: Error panel component for LaTeX compilation errors
pagenumber.tsx: Page number display component

Backend Structure (Python)

Location: jupyterlab_latex/

__init__.py: Server extension entry point
- Registers /latex/build and /latex/synctex handlers
- Sets up Tornado web handlers
build.py: LaTeX compilation handler
- LatexBuildHandler: Main API handler for compilation
- latex_cleanup(): Context manager for file cleanup
- build_tex_cmd_sequence(): Builds LaTeX command sequences
- filter_output(): Filters LaTeX warnings/errors
- run_latex(): Executes LaTeX commands
synctex.py: SyncTeX handler for editor-PDF synchronization
config.py: Configuration schema
- LatexConfig: Configuration class with settings for:
  - latex_command: LaTeX engine (xelatex, pdflatex, tectonic)
  - bib_command: BibTeX command
  - run_times: Number of compilation passes
  - shell_escape: Security setting (restricted/allow/disallow)
  - cleanup: Whether to clean intermediate files
  - manual_cmd_args: Custom command arguments
  - disable_bibtex: Skip BibTeX compilation
util.py: Utility functions
- run_command(): Execute shell commands with async support

Key Features

Live Preview: Opens PDF alongside .tex editor, auto-updates on save
SyncTeX: Click in PDF to jump to source, click in source to jump to PDF
Toolbar: Adds buttons for formatting (bold, italic), sub/superscripts, lists, tables, plots
Menu: Constants (π, φ) and symbols (≤, ≥, ∈) insertion
Error Handling: Shows compilation errors in a dedicated panel

Communication Flow

User saves .tex file → fileChanged signal emitted
Frontend calls /latex/build API endpoint
Backend runs LaTeX compilation
PDF is generated/updated → displayed in PDF viewer
SyncTeX data generated for navigation

Configuration

Users can configure the extension via jupyter_notebook_config.py:

# Use pdflatex instead of xelatex
c.LatexConfig.latex_command = 'pdflatex'

# Custom compilation command
c.LatexConfig.manual_cmd_args = [
    'lualatex',
    '-interaction=nonstopmode',
    '-shell-escape',
    '{filename}.tex'
]

# Disable BibTeX compilation
c.LatexConfig.disable_bibtex = True

# Clean intermediate files
c.LatexConfig.cleanup = True

Testing Approach

The project uses:

TypeScript with ESLint for frontend
Python for backend
No dedicated test suite found in repository

Important Files

package.json: Frontend dependencies and scripts
pyproject.toml: Python package configuration
schema/plugin.json: JupyterLab settings schema
README.md: User documentation
CHANGELOG.md: Release notes

Common Development Tasks

Adding a new toolbar button:
- Create button in EditorToolbarPanel class in src/index.ts
- Add icon SVG in style/icons/
- Register command with app.commands.addCommand()
Adding a new menu item:
- Add to addLatexMenu() function
- Create command handler
Modifying LaTeX compilation:
- Update build_tex_cmd_sequence() in jupyterlab_latex/build.py
- Ensure SyncTeX flags are preserved
Adding new symbols/constants:
- Extend the constants or symbols maps in addLatexMenu()

Notes

The extension uses PDF.js for rendering PDFs in the browser
SyncTeX requires .synctex.gz files generated by LaTeX
File cleanup is optional and controlled by cleanup config
Manual command arguments support {filename} placeholder

qwen3-coder

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

This is the JupyterLab LaTeX extension, which provides live editing of LaTeX documents within JupyterLab. It allows users to compile and preview LaTeX documents in real-time, with features like SyncTeX support for bidirectional navigation between source and preview.

Key Architecture Components

Frontend (TypeScript/JavaScript)

The extension is built using JupyterLab's extension system
Core frontend functionality is in src/index.ts
Uses JupyterLab's application framework, widget system, and services
Implements a PDF viewer using pdfjs-dist for displaying compiled LaTeX output
Provides toolbar buttons and context menu items for LaTeX editing tools
Implements SyncTeX functionality for bidirectional navigation between editor and PDF

Backend (Python)

Server extension in jupyterlab_latex/ directory
Handles LaTeX compilation through the Jupyter Server API
Uses tornado.process.Subprocess for executing LaTeX commands
Provides API endpoints at /latex/build and /latex/synctex
Configuration is handled through jupyterlab_latex/config.py using traitlets

Build System

Uses TypeScript for frontend with tsc compiler
Uses JupyterLab's build system with @jupyterlab/builder
Uses yarn for package management
Builds both frontend and backend extensions

Development Setup

To develop this extension, you need:

NodeJS for building the frontend
Python 3.8+ for the server extension
JupyterLab 4.0+ for running the development environment

Development commands:

jlpm install - Install dependencies
jlpm build - Build the extension
jlpm watch - Watch for changes and rebuild automatically
jupyter labextension develop . --overwrite - Link development version
jupyter server extension enable jupyterlab_latex - Enable server extension

Key Files and Directories

src/index.ts - Main extension entry point
jupyterlab_latex/__init__.py - Extension initialization
jupyterlab_latex/build.py - LaTeX compilation handler
jupyterlab_latex/config.py - Configuration handling
jupyterlab_latex/util.py - Utility functions for running commands
package.json - Extension metadata and build scripts
pyproject.toml - Python package configuration
README.md - Documentation and usage instructions

Key Features

Real-time LaTeX compilation and preview
SyncTeX support for bidirectional navigation between source and PDF
Toolbar buttons for common LaTeX editing tasks (subscript, superscript, lists, etc.)
Context menu integration for preview and SyncTeX actions
Support for multiple LaTeX engines (xelatex, pdflatex, tectonic)
Configuration options for customizing compilation commands
BibTeX support for bibliography compilation
Error handling and display of compilation errors

Testing

The extension can be tested by:

Installing in development mode
Opening a .tex file in JupyterLab
Using the preview toolbar button or context menu
Verifying that LaTeX compilation works and PDF is generated
Testing SyncTeX functionality by clicking in PDF and editor

Common Development Tasks

Adding new toolbar buttons or menu items
Modifying LaTeX compilation commands or options
Enhancing error handling or display
Adding new LaTeX editing features
Improving SyncTeX integration
Customizing the extension's appearance or behavior

granite4-32b

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Common Development Commands

Build: jlpm run build - Compile the TypeScript source and generate JS bundles.
Watch Build: jlpm run watch - Continuously rebuild on file changes while you develop.
Run JupyterLab (with extension enabled): jupyter lab - Launch a local JupyterLab instance that loads this extension. The server extension must be installed with pip install -e . or conda install -c conda-forge jupyterlab-latex beforehand.
Lint: No dedicated lint step; TypeScript compilation (jlpm run build) will surface syntax errors. Ensure you have ESLint/TSLint configured in your editor if desired.
Run Tests (if any): The repository does not expose a test runner command. If unit tests are added later, they would typically be executed via pytest or the JupyterLab testing framework (jlpm run test).
Format Code: Use Prettier/TypeScript formatter configured in the project; simply save files after opening them to auto‑format.

High‑Level Architecture Overview

flowchart TD
    subgraph Frontend (LabExtension)
        A[LaTeX UI Components] -->|Provides toolbar, dialogs, preview panel|
        B[LitElement / React components] --> C[Preview iframe]
    end
    subgraph Backend (Server Extension)
        D[Python entrypoint: jupyterlab_latex]
        E[LatexConfig] -->|Customizes LaTeX command, shell escape|
        F[Bibtex Helper] -->|Runs bibtex if .bib files exist|
        G[Compile Runner] -->|Executes latex_command with arguments|
    end
    A -->|Sends compile request to| D
    D --> E
    D --> F
    D --> G

LabExtension (frontend): Provides the UI for LaTeX preview, toolbar buttons (subscript/superscript/bold/etc.), table creation dialog, and plot insertion. It registers a command latex:showPreview that triggers compilation.
Server Extension (backend): Implements the core logic:
- LatexConfig holds configuration values such as latex_command, run_times, disable_bibtex, etc., which can be overridden via JupyterLab's config system.
- When a compile request arrives, it builds an argument list (default: [latex_command, '-interaction=nonstopmode', '-halt-on-error', ... , '{filename}.tex']).
- It runs the LaTeX command in a subprocess to produce *.pdf. If .bib files are present and bibtex is enabled, it runs bibtex (or custom command) before recompiling.

Configuration: Customization via Jupyter config (jupyter_notebook_config.py):

c.LatexConfig.latex_command = 'pdflatex'      # or 'xelatex', 'lualatex'
c.LatexConfig.run_times   = 2                # multi‑pass for refs
c.LatexConfig.disable_bibtex = False         # enable bibtex by default

Security: The extension respects LaTeX's shell‑escape policy (c.LatexConfig.shell_escape). By default it is restricted; you can set to allow if needed.
Integration Points:
- Toolbar: Buttons call UI actions that emit events handled by the backend via Jupyter messages.
- Commands: Registered with Lab's command palette (latex:showPreview, etc.).
- Mime Renderers: The preview panel renders PDF output using an iframe or PDF.js viewer embedded in the frontend.

Tips for Development

Start a development server: conda env create -f environment.yml && conda activate jupyterlab-latex-env (or use pip/conda as described). Then run jlpm install followed by jlpm run watch and open another terminal with jupyter lab --watch. Any change in TypeScript will trigger a rebuild automatically.
Testing: Currently no test suite is shipped; add unit tests under tests/ using pytest if needed, e.g., pytest -q to run them.
Debugging LaTeX Errors: When compilation fails, the server writes logs to <notebook_dir>/latex.log. The frontend shows these in an error panel for user visibility.
Adding New Features:
- For additional toolbar icons or plot types, extend src/components/Toolbar.tsx and implement corresponding backend handlers that generate LaTeX code snippets.
- To support a new compilation engine (e.g., tectonic), modify LatexConfig.manual_cmd_args via config or expose an option UI in the frontend.
Version Compatibility: Ensure you are using JupyterLab ≥4.x and NodeJS 18+ to avoid breaking changes introduced after this extension's last release.

Resources & References (from README)

BibTeX handling – automatic if .bib files exist, configurable via disable_bibtex flag.
Multiple compile passes – set run_times = 2 for resolving references like \ref{}.
Custom compilation command – use manual_cmd_args to override default LaTeX invocation.
Security settings – control shell escape behavior with shell_escape.

This CLAUDE.md summarizes essential commands and architectural knowledge needed to be productive when working on the jupyterlab-latex repository.

nemotron-30b

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Common Development Commands

Build the extension:

 jlpm build:lib && jlpm build:labextension

For production build:

 jlpm clean && jlpm build:prod

Linting and formatting:

jlpm run lint:check        # Run eslint and stylelint checks
jlpm prettier              # Run prettier to format files
jlpm fix                   # Auto-fix linting issues

Watch mode for development:

jlpm watch                 # Watch src and rebuild on changes

Install the extension in development mode (after cloning):

pip install -e .
jupyter labextension develop . --overwrite
jupyter server extension enable jupyterlab_latex

Architecture Overview

The LaTeX extension consists of two primary components:

LaTeX Front‑end Plugin (latexPlugin) – registers commands, toolbar buttons, and context‑menu items for creating new .tex files, opening live previews, and handling SyncTeX mapping between editor and PDF.
PDFJS Renderer (pdfjsPlugin) – provides a viewer widget for rendering PDF files using PDF.js, tracks PDF widgets, and integrates with the notebook file browser.

Key concepts:

Uses JupyterFrontEnd's plugin system to add commands like latex:open-preview, latex:synctex-edit, and latex:synctex-view.
Leverages WidgetTracker and IPDFJSTracker for managing PDF widget lifecycle.
SyncTeX integration enables forward (editor → PDF) and reverse (PDF → editor) navigation via synctexEditRequest and synctexViewRequest functions.
Toolbar extensions add formatting shortcuts (subscript, superscript, fraction, alignment, list/tree generation, plot insertion) that operate on the current editor selection.

Configuration can be customized through jupyter_notebook_config.py, e.g., changing the LaTeX compilation command or disabling SyncTeX.

Testing & Verification

The project uses a CI pipeline defined in .github/workflows/build.yml. Steps include:
- Installing dependencies (python -m pip install .[test])
- Running lint checks (jlpm run lint:check)
- Building the extension and verifying server/labextension listings.
- Running python -m jupyterlab.browser_check for browser compatibility validation.
There is no dedicated unit‑test command; verification relies on manual preview testing in JupyterLab and automated CI checks.

Customization Points

Compilation Command – modify via c.LatexConfig.manual_cmd_args or c.LatexConfig.synctex_command.
Shell Escape Settings – control with c.LatexConfig.shell_escape.

These sections give a concise map of typical development tasks and the overall extension architecture for anyone (including future Claude instances) who needs to work with this repository.

Failure Modes

Testing revealed distinct ways models fail at agentic tasks:

Failure Mode	Example	Probable Cause
Refuses	functiongemma	Too conservative, confused by system prompts
Hallucinates content	qwen3:30b, granite4:3b	Skips tools, fabricates output
Hallucinates tools	phi4-mini	Invents non-existent tool names
Hallucinates params	mistral-small3.2	Knows tools exist, wrong schema
Narrates tools	magistral	Describes tools in text, never invokes
Stuck on subagent	qwen2.5-coder	Can’t adapt when Explore fails
Context stall	cogito:32b, granite4@32K	Explores correctly, stops mid-task
Nested params	command-r	Wraps params in {“tool_name”:X,“parameters”:{…}}
Silent	rnj-1:8b	Zero output, can’t process system prompts

The more sophisticated failures (wrong params, narration, nested params) suggest models trained on different tool-calling formats or documentation rather than actual Anthropic API interactions. Native context window also matters - magistral (39K native) failed even with 128K allocated.

How Local Models Compare to Cloud

SWE-bench Verified is what everyone uses to evaluate agentic coding - 500 real GitHub issues that models must solve. Here’s how local models compare to cloud:

Frontier Cloud Models (Proprietary)

Model	SWE-bench
Gemini 3 Flash	75-76%
Claude Opus 4.5	74-81%
GPT-5.2	72-75%
Claude Sonnet 4.5	70.6%
Claude Haiku 4.5	68.8%

Large Open Weights (Won’t fit 48GB)

Model	SWE-bench	Size
Devstral 2	72.2%	123B
Qwen3-Coder-480B	67%	480B
DeepSeek-V3.1	66%	671B

Local Models (Fits 48GB)

Model	SWE-bench	Result
devstral-small-2	68.0%	⭐ Winner
qwen3-coder:30b	51.6%	✅ Good
deepseek-r1:32b	41.4%	❌ No tools
qwen2.5-coder:32b	9.0%	❌ Stuck

The gap is surprisingly small. devstral-small-2 at 68% matches Claude Haiku 4.5 and trails Opus by only 6-8 points. A 24B model running locally keeps up with 100B+ models - turns out agentic training matters more than size.

SWE-bench score also predicts Claude Code success: models without published scores aren’t coding-focused and failed my tests.

Conclusions

Local models can do real agentic work now. devstral-small-2 completed the task reliably, with no hand-holding. It’s slower than cloud (17 min vs 2 min), but it runs on my laptop completely offline.

Key Takeaways

devstral-small-2 wins - best results, smallest footprint, built for this
The gap is smaller than I expected - 68% SWE-bench matches Haiku, trails Opus by 8 points
Context window matters - Ollama defaults to 4K; bump it to 64K or watch models hallucinate
SWE-bench predicts success - no published score usually means it won’t work
Speed hurts - 17-24 minutes vs 2 minutes on cloud
Check tool support first - not all models work with Ollama’s Anthropic API

What Works

devstral-small-2 and qwen3-coder both work reliably. The tool calling infrastructure is solid when the model supports it. Ollama 0.14.0 makes setup easy - no more LiteLLM translation layer.

What Doesn’t Work (Yet)

Most models can’t finish multi-step agentic tasks without help. Context overflow causes hallucinations (fabricated URLs, wrong repo names). And 8-12x slower than cloud is hard to ignore.

Critical: Set Context to 64K+

Ollama defaults to 4K context regardless of what model cards advertise. Claude Code’s system prompts overflow this, causing silent failures or hallucinations.

Ollama settings showing context length slider

Context	Result
4-16K	❌ Zero tool calls
32K	⚠️ Starts fine, then hallucinates
64K+	✅ Works

Quick Start

# 1. Install Ollama 0.14.0+ and pull devstral
ollama pull devstral-small-2

# 2. Set context to 64K in Ollama settings (GUI slider)

# 3. Add alias to ~/.zshrc
alias claude-local='ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY=ollama CLAUDE_CODE_USE_BEDROCK=0 claude --model devstral-small-2'

# 4. Run it
source ~/.zshrc
claude-local

Which local models actually work with Claude Code on a 48GB MacBook Pro?

I Tested 18 Local Models So You Don’t Have To

My Setup

Models

Experiments

nemotron-3-nano:30b

gpt-oss:20b

devstral-small-2:24b ⭐ Winner

qwen3-coder:30b

granite4:32b-a9b-h

qwen3:30b

qwen2.5-coder:32b

mistral-small3.2:24b

magistral:24b

cogito:32b

cogito:14b

command-r:35b

Results

✅ Worked

⚠️ Completed With Issues

❌ Failed

Model Outputs

CLAUDE.md

Project Overview

Development Setup

Prerequisites

Installation

Building

Linting

Cleaning

Architecture

Frontend Structure (TypeScript)

Backend Structure (Python)

Key Features

Communication Flow

Configuration

Testing Approach

Important Files

Common Development Tasks

Notes

CLAUDE.md

Overview

Key Architecture Components

Frontend (TypeScript/JavaScript)

Backend (Python)

Build System

Development Setup

Key Files and Directories

Key Features

Testing

Common Development Tasks

CLAUDE.md

Common Development Commands

High‑Level Architecture Overview

Tips for Development

Resources & References (from README)

CLAUDE.md

Common Development Commands

Architecture Overview

Testing & Verification

Customization Points

Failure Modes

How Local Models Compare to Cloud

Conclusions

Key Takeaways

What Works

What Doesn’t Work (Yet)

Critical: Set Context to 64K+

Quick Start

`nemotron-3-nano:30b`

`gpt-oss:20b`

`devstral-small-2:24b` ⭐ Winner

`qwen3-coder:30b`

`granite4:32b-a9b-h`

`qwen3:30b`

`qwen2.5-coder:32b`

`mistral-small3.2:24b`

`magistral:24b`

`cogito:32b`

`cogito:14b`

`command-r:35b`