Which local models actually work with Claude Code on a 48GB MacBook Pro?

I Tested 18 Local Models So You Don’t Have To
Ollama released Anthropic API compatibility in January 2026, so I tested 18 local models with Claude Code to find out which ones actually work for agentic coding tasks.
TL;DR
devstral-small-2:24bis the winner - best quality, fastest, zero interventions- You MUST configure context window - Ollama defaults to 4K; use 64K minimum
- Expect 12-24 min for tasks that take ~2 min with Opus 4.5 - but it works!
- Ollama docs: https://docs.ollama.com/integrations/claude-code
- Anthropic API compatibility: https://docs.ollama.com/api/anthropic-compatibility
My Setup
| Spec | Value |
|---|---|
| Machine | MacBook Pro |
| Chip | Apple M4 Pro |
| RAM | 48 GB unified memory |
| Ollama | v0.14.2 |
Models
Here’s everything I tested, sorted by size:
| Model | Size | Release | SWE-bench | Type |
|---|---|---|---|---|
| nemotron-3-nano:30b | 24GB | Dec 2025 | - | MoE |
| cogito:32b | 20GB | Jul 2025 | - | Hybrid reasoning |
| granite4:32b-a9b-h | ~20GB | Oct 2025 | - | General-purpose |
| command-r:35b | 19GB | Mar 2024 | - | RAG-optimized |
| qwen2.5-coder:32b | 19GB | Nov 2024 | 9.0% | Coding |
| deepseek-r1:32b | 19GB | Jan 2025 | 41.4% | Reasoning |
| qwen3-coder:30b | 18GB | Jul 2025 | 51.6% | Coding |
| qwen3:30b | 18GB | Apr 2025 | - | General-purpose |
| devstral-small-2:24b | 15GB | Dec 2025 | 68.0% | Agentic coding |
| mistral-small3.2:24b | 15GB | Jun 2025 | - | General-purpose |
| magistral:24b | 14GB | Jun 2025 | - | Reasoning |
| gpt-oss:20b | 14GB | Aug 2025 | - | General-purpose |
| cogito:14b | 9GB | Jul 2025 | - | Hybrid reasoning |
| deepseek-coder-v2:16b | 8.9GB | Jun 2024 | - | Coding (no tools) |
| rnj-1:8b | 5.1GB | Dec 2025 | 20.8% | General-purpose |
| phi4-mini:3.8b | 2.5GB | Feb 2025 | - | General-purpose |
| granite4:3b | 2.1GB | Oct 2025 | - | General-purpose |
| functiongemma:270m | 301MB | Dec 2025 | - | Function calling |
Experiments
I chose a very simple task: run /init on a repo (jupyterlab-latex) to generate CLAUDE.md, which is normally the first thing I do in a new repo. It’s deceptively hard though - the model has to discover tools, explore multiple files, and synthesize documentation without hallucinating. One or two runs per model; treat results as field notes.
My first two models (nemotron, gpt-oss) used Ollama’s default context window - which is how I discovered the 4K limit issue. After that, I set context to 64K+ in Ollama’s settings.
nemotron-3-nano:30b
My first attempt revealed a critical failure mode. With the default context window, the model’s thinking block explicitly shows it decided to skip reading files entirely:
“We don’t have details of repo… There haven’t been any reads yet… Let’s assume typical repo structure”
Instead of using tools to explore, it fabricated an entire codebase structure. The output described a React/Node.js monorepo with /frontend and /backend directories - neither of which exist in jupyterlab-latex (a Python/TypeScript JupyterLab extension). It invented commands like npm run dev and referenced non-existent config files.
This failure led me to discover Ollama’s default 4K context limit. After configuring a 128K context window, subsequent attempts worked much better:
Read → Glob → Read → Read → Read → Read → Glob → Read → Write
The model properly explored the codebase, but still stopped mid-task and required a follow-up prompt (“Continue”) to finish. Final output was accurate and high quality - proving the model can work, but context configuration is critical.
gpt-oss:20b
Also tested early with the default context window. Fast but unreliable:
- Direct prompt: Finished quickly but low quality output
/initskill: Tool parameter errors, empty results, needed intervention
Sautéed for 2m 37s (Claude Code's task timer)
devstral-small-2:24b ⭐ Winner
With 128K context configured from the start, this was a perfect run. The model immediately understood the task:
“I’ll analyze this codebase and create a CLAUDE.md file with the essential information for future instances.”
Tool call sequence shows direct, confident tool usage:
Bash → Bash → Bash → Read → Bash → Bash → Bash → Read → Read → Read → Bash → Write
No confusion about subagents or tool parameters - it went straight for Bash and Read to explore the codebase, then used Write to create the output.
The output was 180 lines of documentation with actual function names, Python config examples, and a 5-step communication flow diagram. Every file reference checked out - no hallucinations.
Why did devstral outperform? Mistral trained it specifically for SWE-Bench (68.0% score) and tool-use scenarios. You can see it in the tool calls - direct and confident, no subagent confusion.
Sautéed for 17m 12s
qwen3-coder:30b
Also configured with 128K context. The model’s first instinct was to delegate to a subagent. From the session trace, it tried to spawn an Explore agent twice:
{
"description": "Explore codebase structure",
"prompt": "Explore the structure of this JupyterLab LaTeX extension repository...",
"subagent_type": "Explore"
}
This isn’t an Ollama bug, but a mismatch between what Claude Code can do in a given environment and what the model decides to attempt. Claude Code has a notion of subagents (like an “Explore” helper), but in my setup those weren’t available/configured, so that tool call fails. Ollama’s docs do advertise Claude Code usage, though, so it’s worth calling out explicitly: with third-party models, you should expect occasional “tooling weirdness” like this even if the transport API is compatible.
When the Task tool failed (subagents weren’t configured), qwen3-coder adapted gracefully. Tool sequence shows the recovery:
Task → Task → Bash → Read → Read → Read → Read → Read → Read → Read → Read → Write
After two failed Explore attempts, it switched to direct Bash and Read tools and completed the task without further intervention. Output quality was good - accurate, no hallucinations, but less detailed than devstral (86 lines vs 180).
Sautéed for 23m 48s
granite4:32b-a9b-h
An interesting comparison point - this is IBM’s general-purpose 32B model, not a coding specialist. With 128K context configured, it completed the task in under 7 minutes - the fastest successful run.
The trade-off: minimal exploration. Tool sequence:
Read → Write
Just two tool calls - read the README, write CLAUDE.md. No codebase exploration, no package.json check, no architecture analysis. The output was decent:
- ✅ Correct project type (JupyterLab LaTeX extension)
- ✅ Correct commands (
jlpm run build,jlpm run watch) - ✅ Mermaid architecture diagram
- ⚠️ Some hallucinated details (referenced
src/components/Toolbar.tsxwithout verifying it exists)
At 32K context, it stalled - started correctly (Glob → Read), but got stuck after reading files and never produced output. A different failure mode than devstral’s 32K hallucination.
Verdict: Works, but lazy. General-purpose models can complete agentic tasks but tend to “wing it” with minimal tool use, while coding specialists explore more thoroughly.
Sautéed for ~7m
qwen3:30b
The general-purpose Qwen3 (not the coder variant). This was the worst performer - pure hallucination with zero exploration.
Tool sequence:
Write
Just one tool call. The thinking block is revealing - it explicitly acknowledged it couldn’t see files but proceeded anyway:
“Since I can’t actually see the files, I’ll have to rely on the context provided.”
It inferred file structure from git status in the system prompt, then fabricated everything:
- ❌
python jupyterlab_latex/build.py- wrong command (should bejlpm run build) - ❌
latex_cleanup.py- fabricated filename - ❌
flake8- assumed linter without checking
At 128K context, it consumed 31GB RAM (vs 18GB on disk) - pushing my 48GB system into swap. The memory pressure may have contributed to its laziness, but the thinking block shows it consciously chose to guess rather than explore.
Key finding: The coder fine-tuning isn’t just about coding knowledge - it teaches the model to actually use tools instead of guessing. qwen3-coder explored properly; qwen3 base hallucinated everything.
Sautéed for ~5m
qwen2.5-coder:32b
Failed. Despite having 128K context configured, through multiple attempts it kept reaching for the Explore subagent tool and then abruptly stopping without completing any work. Unlike qwen3-coder which recovered when Explore failed, qwen2.5-coder couldn’t adapt. Same model family, different generation, completely different behavior when things go wrong.
mistral-small3.2:24b
Failed - hallucinated tool parameters. This model understands it should use tools but invents wrong parameter schemas. From the session trace, it tried to call the Task tool with made-up parameters:
// Attempt 1:
{"instruction": "...", "max_depth": 100}
// Attempt 2:
{"subagent_name": "Explore", "subagent_type": "Explore", "subagent_prompt": "..."}
The actual required parameters are description and prompt. When it received clear error messages explaining this, it simply repeated “I’m going to use the Task tool…” and stopped - unable to self-correct.
This is a different failure mode than hallucinating content (qwen3) or refusing (functiongemma). The model has learned about tools but not the actual invocation format. Worth noting: devstral-small-2 is also a Mistral model and works perfectly - the difference is devstral’s agentic specialization.
Memory: 37GB loaded at 128K context (vs 15GB on disk).
magistral:24b
Failed - narrated tools instead of invoking them. This new Mistral reasoning model understood the task and knew which tools to use, but wrote out tool calls as text instead of actually executing them:
"Let me use the Glob tool to find these patterns:
```bash
Glob pattern: **/README.md
Glob pattern: .github/readme*
...
```
Now that I have the relevant files, let's analyze..."
Zero actual tool calls were made. The model described what it would do, assumed the tools had run, and proceeded to the next step. This suggests training on tool documentation without actual tool-use interactions.
Memory: 23GB loaded at 128K context (vs 14GB on disk).
Native context limitation: magistral’s native context is only 39K. Even with Ollama allocating 128K, the model may not effectively use context beyond its training limit - which could explain why it never received the tool invocation format.
cogito:32b
Failed - memory issues and context-limited stall. This hybrid reasoning model has different failure modes depending on context configuration:
At 128K context: Loaded 64GB into memory (41% CPU / 59% GPU split). On my 48GB system, this caused severe memory thrashing - spiky memory pressure, swap usage, and zero tokens produced after 5+ minutes.
At 64K context: Loaded 42GB (8% CPU / 92% GPU). Still tight but runnable. Same stalling behavior.
At 32K context: Loaded 30GB (100% GPU). Actually started working! Made correct Glob and Read calls, explored the codebase properly:
Glob → Read README.md → "Let me create a todo list..."
But then it just… stopped. Said “Let me start with writing the overview section first” and ended without writing anything. Even nudging with “continue” prompt didn’t help - completely stuck.
This is the same pattern as granite4:32b at 32K context: can explore but can’t complete. 32K context is insufficient for task completion - the model loses track of the goal mid-execution.
cogito:14b
Failed - multiple tool issues. Testing the smaller cogito variant to see if the 7-15B range had any surprises. It did, but not good ones.
Memory: Even at 9GB on disk, loaded 45GB at 128K context with 15% CPU offload. At 64K context it was more manageable.
Tool sequence shows multiple failure modes:
Read README.md ✅ → Read copilot-instructions.md ✅ (not found) →
WebSearch ❌ (hallucinated) → TodoWrite ❌ (wrong params, twice) →
Printed CLAUDE.md as text ⚠️
- Hallucinated
WebSearch- tool doesn’t exist in Claude Code, got empty results - Wrong TodoWrite params - missing required
activeFormfield, tried twice without learning - Never used Write tool - just printed the CLAUDE.md content as markdown text instead of writing to file
The generated content was actually reasonable - correct commands, accurate architecture. But the model “completed” the task by printing output rather than writing the file. It understood the goal but couldn’t execute properly.
Time: ~7.7 minutes
The cogito family (both 32b and 14b) consistently fails with Claude Code’s tool schemas - different sizes, different failure modes, same outcome.
command-r:35b
Failed - nested tool parameter schema. The last untested model in the viable 15-35B range. At 128K context it didn’t fit on my GPU. At 64K and 32K it loaded but failed with the same tool schema issue.
From the trace, the model wrapped all tool parameters in a nested structure:
{
"tool_name": "Task",
"parameters": {
"description": "...",
"prompt": "...",
"subagent_type": "general-purpose"
}
}
The correct format is flat parameters at the top level. It made 4 tool calls (3 Task, 1 TodoWrite) - all failed with validation errors like “required parameter description is missing” because the nesting caused parameters to be undefined at the expected level.
Unlike mistral-small3.2 which invented wrong parameter names, command-r uses the correct parameter names but wraps them incorrectly. When it received validation errors, it didn’t retry - just output a text-based “Action Plan” and stopped.
This suggests Cohere’s tool-calling format differs from the Anthropic API schema. The model was trained on a different tool invocation structure.
Context comparison:
- 32K: 4 tool calls, all failed, gave up quickly (~7 min)
- 64K: 29 tool calls, all failed, kept retrying same broken schema (~9.5 min)
More context didn’t help - it just gave the model more runway to keep failing the same way. It never learned from the error messages.
Results
✅ Worked
| Model | Quality | Time | Notes |
|---|---|---|---|
| devstral-small-2 ⭐ | Excellent | 17 min | No hallucinations, no interventions |
| qwen3-coder | Good | 24 min | Recovered after Explore failed |
| granite4:32b | Good | ~7 min | Fast but lazy, minor hallucinations* |
⚠️ Completed With Issues
| Model | Quality | Time | Issue |
|---|---|---|---|
| gpt-oss:20b | Low | ~3 min | Needed intervention |
| nemotron-3-nano | Mixed | - | Hallucinated on first attempt |
| qwen3:30b | Poor | ~5 min | Zero tool calls, fabricated everything |
❌ Failed
| Model | Time | Failure Mode |
|---|---|---|
| qwen2.5-coder:32b | - | Stuck on Explore subagent |
| mistral-small3.2:24b | - | Wrong tool parameter schema |
| magistral:24b | - | Narrated tools instead of invoking |
| cogito:32b | - | Memory thrashing, context stall |
| cogito:14b | ~8 min | Hallucinated WebSearch tool |
| command-r:35b | 7-10 min | Nested tool parameters |
| deepseek-r1:32b | - | No tool support in Ollama |
| deepseek-coder-v2:16b | - | No tool support in Ollama |
| functiongemma:270m | - | Refuses everything |
| granite4:3b | - | Hallucinates without tools |
| phi4-mini:3.8b | - | Invents fake tool names |
| rnj-1:8b | - | Silent, zero output |
*granite4:32b referenced files it never verified existed. It “works” in the sense that it completes the task and produces usable output, but you’d want to review it before trusting it. devstral and qwen3-coder are trustworthy out of the box.
Winner: devstral-small-2 - best quality, smallest footprint, zero interventions.
Model Outputs
Compare the actual CLAUDE.md files generated by each model. Use the tabs to switch between models, or click the side-by-side button to compare them directly:
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
This is jupyterlab-latex, a JupyterLab extension for live-editing of LaTeX documents. It provides:
- Real-time LaTeX compilation and PDF preview
- SyncTeX support for editor-PDF synchronization
- Writing tools (text formatting, lists, tables, plots)
- Mathematical constants and symbols insertion
The extension consists of:
- Frontend (TypeScript): JupyterLab UI components and plugins
- Backend (Python): Server extension for LaTeX compilation
Development Setup
Prerequisites
- Node.js (for building TypeScript)
- Python 3.8+
- JupyterLab 4.0+
- LaTeX distribution (e.g., TeX Live, MiKTeX) with
pdflatex,xelatex, ortectonic
Installation
# Development install
pip install -e .
# Link with JupyterLab
jupyter labextension develop . --overwrite
jupyter server extension enable jupyterlab_latex
# Build TypeScript
jlpm run build
Building
jlpm run build: Build TypeScript and labextensionjlpm run build:prod: Production build (clean + minified)jlpm run watch: Watch mode for developmentjupyter labextension build .: Build labextension
Linting
jlpm run lint: Run ESLint, Prettier, and Stylelintjlpm run eslint: Run ESLint onlyjlpm run prettier: Run Prettier formattingjlpm run stylelint: Run Stylelint
Cleaning
jlpm run clean: Clean build artifactsjlpm run clean:all: Clean everything (lib, labextension, lint cache)
Architecture
Frontend Structure (TypeScript)
Location: src/
index.ts: Main plugin entry point- Defines JupyterFrontEndPlugin for LaTeX extension
- Sets up commands, toolbar buttons, and context menu items
- Manages PDF preview and SyncTeX synchronization
pdf.ts: PDF viewer componentsPDFJSViewer: PDF rendering widgetPDFJSViewerFactory: Document registry factoryPDFJSDocumentWidget: Widget wrapper
error.tsx: Error panel component for LaTeX compilation errorspagenumber.tsx: Page number display component
Backend Structure (Python)
Location: jupyterlab_latex/
__init__.py: Server extension entry point- Registers
/latex/buildand/latex/synctexhandlers - Sets up Tornado web handlers
- Registers
build.py: LaTeX compilation handlerLatexBuildHandler: Main API handler for compilationlatex_cleanup(): Context manager for file cleanupbuild_tex_cmd_sequence(): Builds LaTeX command sequencesfilter_output(): Filters LaTeX warnings/errorsrun_latex(): Executes LaTeX commands
synctex.py: SyncTeX handler for editor-PDF synchronizationconfig.py: Configuration schemaLatexConfig: Configuration class with settings for:latex_command: LaTeX engine (xelatex, pdflatex, tectonic)bib_command: BibTeX commandrun_times: Number of compilation passesshell_escape: Security setting (restricted/allow/disallow)cleanup: Whether to clean intermediate filesmanual_cmd_args: Custom command argumentsdisable_bibtex: Skip BibTeX compilation
util.py: Utility functionsrun_command(): Execute shell commands with async support
Key Features
- Live Preview: Opens PDF alongside .tex editor, auto-updates on save
- SyncTeX: Click in PDF to jump to source, click in source to jump to PDF
- Toolbar: Adds buttons for formatting (bold, italic), sub/superscripts, lists, tables, plots
- Menu: Constants (π, φ) and symbols (≤, ≥, ∈) insertion
- Error Handling: Shows compilation errors in a dedicated panel
Communication Flow
- User saves .tex file →
fileChangedsignal emitted - Frontend calls
/latex/buildAPI endpoint - Backend runs LaTeX compilation
- PDF is generated/updated → displayed in PDF viewer
- SyncTeX data generated for navigation
Configuration
Users can configure the extension via jupyter_notebook_config.py:
# Use pdflatex instead of xelatex
c.LatexConfig.latex_command = 'pdflatex'
# Custom compilation command
c.LatexConfig.manual_cmd_args = [
'lualatex',
'-interaction=nonstopmode',
'-shell-escape',
'{filename}.tex'
]
# Disable BibTeX compilation
c.LatexConfig.disable_bibtex = True
# Clean intermediate files
c.LatexConfig.cleanup = True
Testing Approach
The project uses:
- TypeScript with ESLint for frontend
- Python for backend
- No dedicated test suite found in repository
Important Files
package.json: Frontend dependencies and scriptspyproject.toml: Python package configurationschema/plugin.json: JupyterLab settings schemaREADME.md: User documentationCHANGELOG.md: Release notes
Common Development Tasks
Adding a new toolbar button:
- Create button in
EditorToolbarPanelclass insrc/index.ts - Add icon SVG in
style/icons/ - Register command with
app.commands.addCommand()
- Create button in
Adding a new menu item:
- Add to
addLatexMenu()function - Create command handler
- Add to
Modifying LaTeX compilation:
- Update
build_tex_cmd_sequence()injupyterlab_latex/build.py - Ensure SyncTeX flags are preserved
- Update
Adding new symbols/constants:
- Extend the
constantsorsymbolsmaps inaddLatexMenu()
- Extend the
Notes
- The extension uses PDF.js for rendering PDFs in the browser
- SyncTeX requires
.synctex.gzfiles generated by LaTeX - File cleanup is optional and controlled by
cleanupconfig - Manual command arguments support
{filename}placeholder
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Overview
This is the JupyterLab LaTeX extension, which provides live editing of LaTeX documents within JupyterLab. It allows users to compile and preview LaTeX documents in real-time, with features like SyncTeX support for bidirectional navigation between source and preview.
Key Architecture Components
Frontend (TypeScript/JavaScript)
- The extension is built using JupyterLab's extension system
- Core frontend functionality is in
src/index.ts - Uses JupyterLab's application framework, widget system, and services
- Implements a PDF viewer using pdfjs-dist for displaying compiled LaTeX output
- Provides toolbar buttons and context menu items for LaTeX editing tools
- Implements SyncTeX functionality for bidirectional navigation between editor and PDF
Backend (Python)
- Server extension in
jupyterlab_latex/directory - Handles LaTeX compilation through the Jupyter Server API
- Uses
tornado.process.Subprocessfor executing LaTeX commands - Provides API endpoints at
/latex/buildand/latex/synctex - Configuration is handled through
jupyterlab_latex/config.pyusing traitlets
Build System
- Uses TypeScript for frontend with
tsccompiler - Uses JupyterLab's build system with
@jupyterlab/builder - Uses yarn for package management
- Builds both frontend and backend extensions
Development Setup
To develop this extension, you need:
- NodeJS for building the frontend
- Python 3.8+ for the server extension
- JupyterLab 4.0+ for running the development environment
Development commands:
jlpm install- Install dependenciesjlpm build- Build the extensionjlpm watch- Watch for changes and rebuild automaticallyjupyter labextension develop . --overwrite- Link development versionjupyter server extension enable jupyterlab_latex- Enable server extension
Key Files and Directories
src/index.ts- Main extension entry pointjupyterlab_latex/__init__.py- Extension initializationjupyterlab_latex/build.py- LaTeX compilation handlerjupyterlab_latex/config.py- Configuration handlingjupyterlab_latex/util.py- Utility functions for running commandspackage.json- Extension metadata and build scriptspyproject.toml- Python package configurationREADME.md- Documentation and usage instructions
Key Features
- Real-time LaTeX compilation and preview
- SyncTeX support for bidirectional navigation between source and PDF
- Toolbar buttons for common LaTeX editing tasks (subscript, superscript, lists, etc.)
- Context menu integration for preview and SyncTeX actions
- Support for multiple LaTeX engines (xelatex, pdflatex, tectonic)
- Configuration options for customizing compilation commands
- BibTeX support for bibliography compilation
- Error handling and display of compilation errors
Testing
The extension can be tested by:
- Installing in development mode
- Opening a
.texfile in JupyterLab - Using the preview toolbar button or context menu
- Verifying that LaTeX compilation works and PDF is generated
- Testing SyncTeX functionality by clicking in PDF and editor
Common Development Tasks
- Adding new toolbar buttons or menu items
- Modifying LaTeX compilation commands or options
- Enhancing error handling or display
- Adding new LaTeX editing features
- Improving SyncTeX integration
- Customizing the extension's appearance or behavior
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Common Development Commands
- Build:
jlpm run build- Compile the TypeScript source and generate JS bundles. - Watch Build:
jlpm run watch- Continuously rebuild on file changes while you develop. - Run JupyterLab (with extension enabled):
jupyter lab- Launch a local JupyterLab instance that loads this extension. The server extension must be installed withpip install -e .orconda install -c conda-forge jupyterlab-latexbeforehand. - Lint: No dedicated lint step; TypeScript compilation (
jlpm run build) will surface syntax errors. Ensure you have ESLint/TSLint configured in your editor if desired. - Run Tests (if any): The repository does not expose a test runner command. If unit tests are added later, they would typically be executed via
pytestor the JupyterLab testing framework (jlpm run test). - Format Code: Use Prettier/TypeScript formatter configured in the project; simply save files after opening them to auto‑format.
High‑Level Architecture Overview
flowchart TD
subgraph Frontend (LabExtension)
A[LaTeX UI Components] -->|Provides toolbar, dialogs, preview panel|
B[LitElement / React components] --> C[Preview iframe]
end
subgraph Backend (Server Extension)
D[Python entrypoint: jupyterlab_latex]
E[LatexConfig] -->|Customizes LaTeX command, shell escape|
F[Bibtex Helper] -->|Runs bibtex if .bib files exist|
G[Compile Runner] -->|Executes latex_command with arguments|
end
A -->|Sends compile request to| D
D --> E
D --> F
D --> G
- LabExtension (
frontend): Provides the UI for LaTeX preview, toolbar buttons (subscript/superscript/bold/etc.), table creation dialog, and plot insertion. It registers a commandlatex:showPreviewthat triggers compilation. - Server Extension (
backend): Implements the core logic:- LatexConfig holds configuration values such as
latex_command,run_times,disable_bibtex, etc., which can be overridden via JupyterLab's config system. - When a compile request arrives, it builds an argument list (default:
[latex_command, '-interaction=nonstopmode', '-halt-on-error', ... , '{filename}.tex']). - It runs the LaTeX command in a subprocess to produce
*.pdf. If.bibfiles are present and bibtex is enabled, it runsbibtex(or custom command) before recompiling.
- LatexConfig holds configuration values such as
- Configuration: Customization via Jupyter config (
jupyter_notebook_config.py):c.LatexConfig.latex_command = 'pdflatex' # or 'xelatex', 'lualatex' c.LatexConfig.run_times = 2 # multi‑pass for refs c.LatexConfig.disable_bibtex = False # enable bibtex by default - Security: The extension respects LaTeX's shell‑escape policy (
c.LatexConfig.shell_escape). By default it isrestricted; you can set toallowif needed. - Integration Points:
- Toolbar: Buttons call UI actions that emit events handled by the backend via Jupyter messages.
- Commands: Registered with Lab's command palette (
latex:showPreview, etc.). - Mime Renderers: The preview panel renders PDF output using an iframe or PDF.js viewer embedded in the frontend.
Tips for Development
- Start a development server:
conda env create -f environment.yml && conda activate jupyterlab-latex-env(or use pip/conda as described). Then runjlpm installfollowed byjlpm run watchand open another terminal withjupyter lab --watch. Any change in TypeScript will trigger a rebuild automatically. - Testing: Currently no test suite is shipped; add unit tests under
tests/using pytest if needed, e.g.,pytest -qto run them. - Debugging LaTeX Errors: When compilation fails, the server writes logs to
<notebook_dir>/latex.log. The frontend shows these in an error panel for user visibility. - Adding New Features:
- For additional toolbar icons or plot types, extend
src/components/Toolbar.tsxand implement corresponding backend handlers that generate LaTeX code snippets. - To support a new compilation engine (e.g.,
tectonic), modifyLatexConfig.manual_cmd_argsvia config or expose an option UI in the frontend.
- For additional toolbar icons or plot types, extend
- Version Compatibility: Ensure you are using JupyterLab ≥4.x and NodeJS 18+ to avoid breaking changes introduced after this extension's last release.
Resources & References (from README)
- BibTeX handling – automatic if
.bibfiles exist, configurable viadisable_bibtexflag. - Multiple compile passes – set
run_times = 2for resolving references like\ref{}. - Custom compilation command – use
manual_cmd_argsto override default LaTeX invocation. - Security settings – control shell escape behavior with
shell_escape.
This CLAUDE.md summarizes essential commands and architectural knowledge needed to be productive when working on the jupyterlab-latex repository.
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Common Development Commands
Build the extension:
jlpm build:lib && jlpm build:labextensionFor production build:
jlpm clean && jlpm build:prodLinting and formatting:
jlpm run lint:check # Run eslint and stylelint checks jlpm prettier # Run prettier to format files jlpm fix # Auto-fix linting issuesWatch mode for development:
jlpm watch # Watch src and rebuild on changesInstall the extension in development mode (after cloning):
pip install -e . jupyter labextension develop . --overwrite jupyter server extension enable jupyterlab_latex
Architecture Overview
The LaTeX extension consists of two primary components:
LaTeX Front‑end Plugin (
latexPlugin) – registers commands, toolbar buttons, and context‑menu items for creating new.texfiles, opening live previews, and handling SyncTeX mapping between editor and PDF.PDFJS Renderer (
pdfjsPlugin) – provides a viewer widget for rendering PDF files using PDF.js, tracks PDF widgets, and integrates with the notebook file browser.
Key concepts:
- Uses JupyterFrontEnd's plugin system to add commands like
latex:open-preview,latex:synctex-edit, andlatex:synctex-view. - Leverages
WidgetTrackerandIPDFJSTrackerfor managing PDF widget lifecycle. - SyncTeX integration enables forward (editor → PDF) and reverse (PDF → editor) navigation via
synctexEditRequestandsynctexViewRequestfunctions. - Toolbar extensions add formatting shortcuts (subscript, superscript, fraction, alignment, list/tree generation, plot insertion) that operate on the current editor selection.
Configuration can be customized through jupyter_notebook_config.py, e.g., changing the LaTeX compilation command or disabling SyncTeX.
Testing & Verification
- The project uses a CI pipeline defined in
.github/workflows/build.yml. Steps include:- Installing dependencies (
python -m pip install .[test]) - Running lint checks (
jlpm run lint:check) - Building the extension and verifying server/labextension listings.
- Running
python -m jupyterlab.browser_checkfor browser compatibility validation.
- Installing dependencies (
- There is no dedicated unit‑test command; verification relies on manual preview testing in JupyterLab and automated CI checks.
Customization Points
- Compilation Command – modify via
c.LatexConfig.manual_cmd_argsorc.LatexConfig.synctex_command. - Shell Escape Settings – control with
c.LatexConfig.shell_escape.
These sections give a concise map of typical development tasks and the overall extension architecture for anyone (including future Claude instances) who needs to work with this repository.
Failure Modes
Testing revealed distinct ways models fail at agentic tasks:
| Failure Mode | Example | Probable Cause |
|---|---|---|
| Refuses | functiongemma | Too conservative, confused by system prompts |
| Hallucinates content | qwen3:30b, granite4:3b | Skips tools, fabricates output |
| Hallucinates tools | phi4-mini | Invents non-existent tool names |
| Hallucinates params | mistral-small3.2 | Knows tools exist, wrong schema |
| Narrates tools | magistral | Describes tools in text, never invokes |
| Stuck on subagent | qwen2.5-coder | Can’t adapt when Explore fails |
| Context stall | cogito:32b, granite4@32K | Explores correctly, stops mid-task |
| Nested params | command-r | Wraps params in {“tool_name”:X,“parameters”:{…}} |
| Silent | rnj-1:8b | Zero output, can’t process system prompts |
The more sophisticated failures (wrong params, narration, nested params) suggest models trained on different tool-calling formats or documentation rather than actual Anthropic API interactions. Native context window also matters - magistral (39K native) failed even with 128K allocated.
How Local Models Compare to Cloud
SWE-bench Verified is what everyone uses to evaluate agentic coding - 500 real GitHub issues that models must solve. Here’s how local models compare to cloud:
Frontier Cloud Models (Proprietary)
| Model | SWE-bench |
|---|---|
| Gemini 3 Flash | 75-76% |
| Claude Opus 4.5 | 74-81% |
| GPT-5.2 | 72-75% |
| Claude Sonnet 4.5 | 70.6% |
| Claude Haiku 4.5 | 68.8% |
Large Open Weights (Won’t fit 48GB)
| Model | SWE-bench | Size |
|---|---|---|
| Devstral 2 | 72.2% | 123B |
| Qwen3-Coder-480B | 67% | 480B |
| DeepSeek-V3.1 | 66% | 671B |
Local Models (Fits 48GB)
| Model | SWE-bench | Result |
|---|---|---|
| devstral-small-2 | 68.0% | ⭐ Winner |
| qwen3-coder:30b | 51.6% | ✅ Good |
| deepseek-r1:32b | 41.4% | ❌ No tools |
| qwen2.5-coder:32b | 9.0% | ❌ Stuck |
The gap is surprisingly small. devstral-small-2 at 68% matches Claude Haiku 4.5 and trails Opus by only 6-8 points. A 24B model running locally keeps up with 100B+ models - turns out agentic training matters more than size.
SWE-bench score also predicts Claude Code success: models without published scores aren’t coding-focused and failed my tests.
Conclusions
Local models can do real agentic work now. devstral-small-2 completed the task reliably, with no hand-holding. It’s slower than cloud (17 min vs 2 min), but it runs on my laptop completely offline.
Key Takeaways
- devstral-small-2 wins - best results, smallest footprint, built for this
- The gap is smaller than I expected - 68% SWE-bench matches Haiku, trails Opus by 8 points
- Context window matters - Ollama defaults to 4K; bump it to 64K or watch models hallucinate
- SWE-bench predicts success - no published score usually means it won’t work
- Speed hurts - 17-24 minutes vs 2 minutes on cloud
- Check tool support first - not all models work with Ollama’s Anthropic API
What Works
devstral-small-2 and qwen3-coder both work reliably. The tool calling infrastructure is solid when the model supports it. Ollama 0.14.0 makes setup easy - no more LiteLLM translation layer.
What Doesn’t Work (Yet)
Most models can’t finish multi-step agentic tasks without help. Context overflow causes hallucinations (fabricated URLs, wrong repo names). And 8-12x slower than cloud is hard to ignore.
Critical: Set Context to 64K+
Ollama defaults to 4K context regardless of what model cards advertise. Claude Code’s system prompts overflow this, causing silent failures or hallucinations.

| Context | Result |
|---|---|
| 4-16K | ❌ Zero tool calls |
| 32K | ⚠️ Starts fine, then hallucinates |
| 64K+ | ✅ Works |
Quick Start
# 1. Install Ollama 0.14.0+ and pull devstral
ollama pull devstral-small-2
# 2. Set context to 64K in Ollama settings (GUI slider)
# 3. Add alias to ~/.zshrc
alias claude-local='ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY=ollama CLAUDE_CODE_USE_BEDROCK=0 claude --model devstral-small-2'
# 4. Run it
source ~/.zshrc
claude-local