Converted research paper to Kindle EPUB

Summary

Closed, Completed

Understand
Classified source PDF and inspected intermediate Markdown.
Understand
Build
Created and improved a custom build script with state-tracking and table handling.
Build
Build
Built the EPUB file.
Build
Verify
Performed deterministic QA checks and readability spot-checks.
Verify
Verify
Verified EPUB TOC structure and heading correctness.
Verify

Convert this research paper to read on kindle paperwhite 7th gen. /Users/kiran/Downloads/DeepSeek_V4.pdf

>ls /Users/kiran/Downloads/pdf2kindle/

__pycache__
_tmp_epub_inspect
_tmp_fp_fetch
a-close-call-terence-tao.epub
a-close-call-terence-tao.md
AGENTS.md
ASER_2024_Final-Report_13_2_24-1.pdf
aws-builders-library.epub
build_hybrid_markdown.py
compile_fp_collection.py
CONVERSION_QA_CHECKLIST.md
event-based-concurrency-1.epub
extract.py
first-principles-socratic-systems-collection.epub
first-principles-socratic-systems-collection.md
kindle.css
License-and-Permits-Brief-v2.epub
License-and-Permits-Brief.epub
on-keeping-a-notebook.epub
pdf_sources
pdf2kindle.sh
pyproject.toml
qa_epub.py
qa_shared.py
README.md
semaphores-kindle.epub
turtles-termites-traffic-jams.epub
uv.lock

>ls -lh /Users/kiran/Downloads/DeepSeek_V4.pdf && pdfinfo /Users/kiran/Downloads/DeepSeek_V4.pdf 2>/dev/null | head -30

-rw-r--r--@ 1 kiran  staff   4.3M Apr 29 17:21 /Users/kiran/Downloads/DeepSeek_V4.pdf
Title:           
Subject:         
Keywords:        
Author:          
Creator:         LaTeX with hyperref
Producer:        pdfTeX-1.40.27
CreationDate:    Fri Apr 24 15:27:37 2026 IST
ModDate:         Fri Apr 24 15:27:37 2026 IST
Custom Metadata: yes
Metadata Stream: no
Tagged:          no
UserProperties:  no
Suspects:        no
Form:            none
JavaScript:      no
Pages:           58
Encrypted:       no
Page size:       595.276 x 841.89 pts (A4)
Page rot:        0
File size:       4479907 bytes
Optimized:       yes
PDF version:     1.5

>cd /Users/kiran/Downloads/pdf2kindle && pdftotext -l 2 /Users/kiran/Downloads/DeepSeek_V4.pdf - | head -80

DeepSeek-V4:
Towards Highly Efficient Million-Token Context Intelligence
DeepSeek-AI
[REDACTED]

Abstract
We present a preview version of DeepSeek-V4 series, including two strong Mixture-ofExperts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) and
DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of
one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA)
and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) ManifoldConstrained Hyper-Connections (mHC) that enhance conventional residual connections; (3)
and the Muon optimizer for faster convergence and greater training stability. We pre-train
both models on more than 32T diverse and high-quality tokens, followed by a comprehensive
post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-ProMax, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for
open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are
highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeekV4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared
with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby
making long-horizon tasks and further test-time scaling more feasible. The model checkpoints
are available at https://huggingface.co/collections/deepseek-ai/deepseek-v4.

DeepSeek-V4-Pro-Max

Claude-Opus-4.6-Max
90.2

89.1

85.9

Accuracy / Pass@1 (%)

80

78.1

75.6

GPT-5.4-xHigh

Gemini-3.1-Pro-High

32063168

3052

80.680.880.6
75.1

67.9

65.4

68.5

60 57.9

54.6

51.8
46.2 45.3

37.740.039.8

40

47.2

44.4

48.8

0.8
0.6

0.2
0.0

0

(Pass@1)

Source classified as: text-based PDF (LaTeX-generated, 58 pages, single column). Running first pass with --keep-md and --no-pause so I can inspect the intermediate Markdown and rebuild after structural cleanup.

>cd /Users/kiran/Downloads/pdf2kindle && ./pdf2kindle.sh --no-pause --keep-md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" /Users/kiran/Downloads/DeepSeek_V4.pdf deepseek-v4.epub 2>&1 | tail -80

- Suggested fix: Ensure internal fragment links point to real ids in the target document.
34. [Navigation & TOC Validation] Heading anchors/targets exist where needed
   - Evidence: Missing fragment target #swa-kv in EPUB/text/ch001.xhtml (linked from EPUB/nav.xhtml)
   - Impact: navigation
   - Suggested fix: Ensure internal fragment links point to real ids in the target document.
35. [Navigation & TOC Validation] Heading anchors/targets exist where needed
   - Evidence: Missing fragment target #swa-kv-1 in EPUB/text/ch001.xhtml (linked from EPUB/nav.xhtml)
   - Impact: navigation
   - Suggested fix: Ensure internal fragment links point to real ids in the target document.
36. [Navigation & TOC Validation] Heading anchors/targets exist where needed
   - Evidence: Missing fragment target #swa-kv-2 in EPUB/text/ch001.xhtml (linked from EPUB/nav.xhtml)
   - Impact: navigation
   - Suggested fix: Ensure internal fragment links point to real ids in the target document.
37. [Navigation & TOC Validation] Heading anchors/targets exist where needed
   - Evidence: Missing fragment target #swa-kv-3 in EPUB/text/ch001.xhtml (linked from EPUB/nav.xhtml)
   - Impact: navigation
   - Suggested fix: Ensure internal fragment links point to real ids in the target document.
38. [Navigation & TOC Validation] Heading anchors/targets exist where needed
   - Evidence: Missing fragment target #csa-kv in EPUB/text/ch001.xhtml (linked from EPUB/nav.xhtml)
   - Impact: navigation
   - Suggested fix: Ensure internal fragment links point to real ids in the target document.
39. [Navigation & TOC Validation] Heading anchors/targets exist where needed
   - Evidence: Missing fragment target #csa-kv-1 in EPUB/text/ch001.xhtml (linked from EPUB/nav.xhtml)
   - Impact: navigation
   - Suggested fix: Ensure internal fragment links point to real ids in the target document.
40. [Navigation & TOC Validation] Heading anchors/targets exist where needed
   - Evidence: Missing fragment target #shots in EPUB/text/ch002.xhtml (linked from EPUB/nav.xhtml)
   - Impact: navigation
   - Suggested fix: Ensure internal fragment links point to real ids in the target document.
41. [Navigation & TOC Validation] Heading anchors/targets exist where needed
   - Evidence: Missing fragment target #bbh-em in EPUB/text/ch002.xhtml (linked from EPUB/nav.xhtml)
   - Impact: navigation
   - Suggested fix: Ensure internal fragment links point to real ids in the target document.
42. [Navigation & TOC Validation] Heading anchors/targets exist where needed
   - Evidence: Missing fragment target #tools in EPUB/text/ch002.xhtml (linked from EPUB/nav.xhtml)
   - Impact: navigation
   - Suggested fix: Ensure internal fragment links point to real ids in the target document.
43. [Navigation & TOC Validation] Heading anchors/targets exist where needed
   - Evidence: Missing fragment target #mrcr-1m-mmr in EPUB/text/ch002.xhtml (linked from EPUB/nav.xhtml)
   - Impact: navigation
   - Suggested fix: Ensure internal fragment links point to real ids in the target document.
44. [Navigation & TOC Validation] Heading anchors/targets exist where needed
   - Evidence: Missing fragment target #mrcr-1mmmr in EPUB/text/ch002.xhtml (linked from EPUB/nav.xhtml)
   - Impact: navigation
   - Suggested fix: Ensure internal fragment links point to real ids in the target document.
45. [Navigation & TOC Validation] Heading anchors/targets exist where needed
   - Evidence: Missing fragment target #none in EPUB/text/ch002.xhtml (linked from EPUB/nav.xhtml)
   - Impact: navigation
   - Suggested fix: Ensure internal fragment links point to real ids in the target document.
46. [Technical EPUB Validity] EPUB opens cleanly without archive/format errors
   - Evidence: Unable to parse XHTML content EPUB/text/ch001.xhtml: not well-formed (invalid token): line 316, column 3
   - Impact: technical
   - Suggested fix: Ensure content documents are well-formed XHTML.
47. [Technical EPUB Validity] EPUB opens cleanly without archive/format errors
   - Evidence: Unable to parse XHTML content EPUB/text/ch002.xhtml: mismatched tag: line 54, column 46
   - Impact: technical
   - Suggested fix: Ensure content documents are well-formed XHTML.
48. [Artifact Cleanup Sweep] Wrapped URLs and broken inline text repaired
   - Evidence: Split URL detected in source markdown: DeepSeek_V4.md
   - Impact: readability
   - Suggested fix: Normalize wrapped URLs before final build.
49. [Artifact Cleanup Sweep] OCR artifacts spot-cleaned (if OCR used)
   - Evidence: Control characters detected in source markdown: DeepSeek_V4.md
   - Impact: readability
   - Suggested fix: Strip control characters before EPUB generation.
50. [Artifact Cleanup Sweep] Accidental line joins/splits fixed
   - Evidence: Split hyphenated words detected in source markdown: deci- summary, reason- prompt
   - Impact: readability
   - Suggested fix: De-hyphenate wrapped words before final build.
51. [Artifact Cleanup Sweep] Repeated page headers/footers/page numbers removed
   - Evidence: Suspicious repeated lines in source markdown: 'Total (总计)' x4; 'Subtotal (小计)' x8
   - Impact: readability
   - Suggested fix: Inspect for leaked headers/footers or other repeated boilerplate and remove them before final build.
52. [Artifact Cleanup Sweep] Paragraph flow is smooth for e-ink reading
   - Evidence: Very long paragraph detected in DeepSeek_V4.md: By employing hybrid CSA and HCA, along with precision optimizations on computation and storage, DeepSeek-V4 series achieve significantly low…
   - Impact: readability
   - Suggested fix: Split dense paragraphs into smaller Kindle-friendly chunks before final build.

    WARNING: Validation found issues (see above). Review before sending to Kindle.
  Markdown file kept: DeepSeek_V4.md

>cd /Users/kiran/Downloads/pdf2kindle && ./pdf2kindle.sh --no-pause --keep-md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" /Users/kiran/Downloads/DeepSeek_V4.pdf deepseek-v4.epub 2>&1 | head -120

==> Step 1: Extracting text from PDF...
Extracting text from: /Users/kiran/Downloads/DeepSeek_V4.pdf
Applying cleanup heuristics...

Markdown written to: DeepSeek_V4.md
  2352 lines, 59 pages processed
  15 headings detected, 0 header/footer patterns stripped, 508 page numbers removed, 0 TOC lines removed

==> Step 3: Converting Markdown to EPUB...
[WARNING] Could not convert TeX math TOOL_NAME"> <|DSML|parameter name=", rendering as TeX:
  TOOL_NAME"> <|DSML|parameter name="
           ^
  unexpected '"'
  expecting "\\bangle", "\\brace", "\\brack", "\\choose", "\\displaystyle", "\\textstyle", "\\scriptstyle", "\\scriptscriptstyle", "{", "\\operatorname", letter, digit, ".", "!", "'", "''", "'''", "''''", "*", "+", ",", "-", ".", "/", ":", ":=", ";", "<", "=", ">", "?", "@", "~", "_", "^", "'", "\\left", "\\lVert", "\\", "\\hyperref" or end of input
[WARNING] Could not convert TeX math PARAMETER_VALUE </|DSML|parameter> ...
  </|DSML|invoke> <|DSML|invoke name=", rendering as TeX:
  <|DSML|invoke name="
                     ^
  unexpected '"'
  expecting "\\bangle", "\\brace", "\\brack", "\\choose", "\\displaystyle", "\\textstyle", "\\scriptstyle", "\\scriptscriptstyle", "{", "\\operatorname", letter, digit, ".", "!", "'", "''", "'''", "''''", "*", "+", ",", "-", ".", "/", ":", ":=", ";", "<", "=", ">", "?", "@", "~", "_", "^", "'", "\\left", "\\lVert", "\\", "\\hyperref" or end of input

Done! EPUB written to: deepseek-v4.epub
  Transfer to Kindle via USB or Send to Kindle.

==> Step 4: Validating EPUB...
warning: The `tool.uv.dev-dependencies` field (used in `pyproject.toml`) is deprecated and will be removed in a future release; use `dependency-groups.dev` instead
Conversion QA - Failed Items
1. [Technical EPUB Validity] EPUBCheck: RSC-005
   - Evidence: [ERROR] Error while parsing file: Anchors within nav elements must contain text — deepseek-v4.epub/EPUB/nav.xhtml:13:1673
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
2. [Technical EPUB Validity] EPUBCheck: RSC-005
   - Evidence: [ERROR] Error while parsing file: Anchors within nav elements must contain text — deepseek-v4.epub/EPUB/nav.xhtml:13:1914
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
3. [Technical EPUB Validity] EPUBCheck: RSC-005
   - Evidence: [ERROR] Error while parsing file: element "think" not allowed here; expected the element end-tag, text, element "a", "abbr", "address", "area", "article", "aside", "audio", "b", "bdi", "bdo", "blockquote", "br", "button", "canvas", "cite", "code", "data", "datalist", "del", "details", "dfn", "dialog", "div", "dl", "em", "embed", "epub:switch", "epub:trigger", "fieldset", "figure", "footer", "form", "h1", "h2", "h3", "h4", "h5", "h6", "header", "hgroup", "hr", "i", "iframe", "img", "input", "ins", "kbd", "label", "link", "main", "map", "mark", "menu", "meta", "meter", "nav", "ns1:math", "ns2:svg", "object", "ol", "output", "p", "picture", "pre", "progress", "q", "ruby", "s", "samp", "script", "section", "select", "slot", "small", "span", "strong", "sub", "sup", "table", "template", "textarea", "time", "u", "ul", "var", "video" or "wbr" (with xmlns:ns1="http://www.w3.org/1998/Math/MathML" xmlns:ns2="http://www.w3.org/2000/svg") or an element from another namespace — deepseek-v4.epub/EPUB/text/ch002.xhtml:47:1584
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
4. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:727
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
5. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:801
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
6. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:870
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
7. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:939
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
8. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:1009
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
9. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:1077
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
10. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:1147
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
11. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:1371
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
12. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:1443
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
13. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:1522
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
14. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:1594
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
15. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:52:54
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
16. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:58:50
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
17. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:64:52
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
18. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:70:52
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
19. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:76:52
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
20. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:82:50
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
21. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:88:52
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
22. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:106:49
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
23. [Technical EPUB Validity] EPUBCheck: RSC-012
   - Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:112:55
   - Impact: technical
   - Suggested fix: Refer to the EPUB specification for details.
24. [Technical EPUB Validity] EPUBCheck: RSC-012

>cd /Users/kiran/Downloads/pdf2kindle && wc -l DeepSeek_V4.md && rg -n "^#" DeepSeek_V4.md | head -80

2351 DeepSeek_V4.md
128:## Architecture
417:## Csa / Hca
585:## 𝑍 = 𝐻 ·𝑊 ,
593:## 𝑍 = 𝐻 ·𝑊 ,
725:## 𝑍 = 𝐻 ·𝑊 ,
944:## Swa Kv
950:## Swa Kv
960:## Swa Kv
980:## Swa Kv
992:## Csa Kv
1011:## Csa Kv
1066:# Shots
1098:## Bbh (Em)
1194:## Tools
1201:### Available Tool Schemas
1348:## Mrcr 1M (Mmr)
1430:## Mrcr 1M(Mmr)
1606:## None
1977:#
2035:# V4 win V3.2 win
2109:# DS win Gem win Tie
2265:#
2331:# DS Opus Tie

>cd /Users/kiran/Downloads/pdf2kindle && pdftotext -f 3 -l 4 /Users/kiran/Downloads/DeepSeek_V4.pdf -

5

Post-Training

29

5.1

Post-Training Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

5.1.1

Specialist Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

5.1.2

On-Policy Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

RL and OPD Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

5.2.1

FP4 Quantization Integration . . . . . . . . . . . . . . . . . . . . . . . . . .

34

5.2.2

Efficient Teacher Scheduling for Full-Vocabulary OPD . . . . . . . . . . .

34

5.2.3

Preemptible and Fault-Tolerant Rollout Service . . . . . . . . . . . . . . .

34

5.2.4

Scaling RL Framework for Million-Token Context . . . . . . . . . . . . . .

35

5.2.5

Sandbox Infrastructure for Agentic AI . . . . . . . . . . . . . . . . . . . . .

35

Standard Benchmark Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

5.3.1

Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

5.3.2

Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

Performance on Real-World Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

5.4.1

Chinese Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

5.4.2

Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

5.4.3

White-Collar Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

5.4.4

Code Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

5.2

5.3

5.4

6

Conclusion, Limitations, and Future Directions

A Author List and Acknowledgment

44
54

A.1 Author List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

A.2 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

B Evaluation Details

55

3

1. Introduction
The emergence of reasoning models (DeepSeek-AI, 2025; OpenAI, 2024c) has established a
new paradigm of test-time scaling, driving substantial performance gains for Large Language
Models (LLMs). However, this scaling paradigm is fundamentally constrained by the quadratic
computational complexity of the vanilla attention mechanism (Vaswani et al., 2017), which
creates a prohibitive bottleneck for ultra-long contexts and reasoning processes. Concurrently,
the emergence of long-horizon scenarios and tasks — from complex agentic workflows to
massive cross-document analysis — has also made efficient support for ultra-long contexts
critical for future progress. While recent open-source efforts (Bai et al., 2025a; DeepSeek-AI,
2024; MiniMax, 2025; Qwen, 2025) have advanced general capabilities, this core architectural
inefficiency in handling ultra-long sequences remains a key impediment, limiting further gains
from test-time scaling and hindering further exploration into long-horizon scenarios and tasks.
In order to break the efficiency barrier in ultra-long contexts, we develop the DeepSeek-V4
series, including the preview versions of DeepSeek-V4-Pro with 1.6T parameters (49B activated)
and DeepSeek-V4-Flash with 284B parameters (13B activated). Through architectural innovations, DeepSeek-V4 series achieve a dramatic leap in computational efficiency for processing
ultra-long sequences. This breakthrough enables efficient support for a context length of one
million tokens, ushering in a new era of million-length contexts for next-generation LLMs. We
believe our capability to efficiently handle ultra-long sequences unlocks the next frontier of
test-time scaling, paves the way for deeper research into long-horizon tasks, and establishes a
necessary foundation for exploring future paradigms like online learning.
Compared with the DeepSeek-V3 architecture (DeepSeek-AI, 2024), DeepSeek-V4 series
retain the DeepSeekMoE framework (Dai et al., 2024) and Multi-Token Prediction (MTP) strategy,
while introducing several key innovations in architecture and optimization. To enhance longcontext efficiency, we design a hybrid attention mechanism combining Compressed Sparse
Attention (CSA) and Heavily Compressed Attention (HCA). CSA compresses the KV caches
along the sequence dimension and then performs DeepSeek Sparse Attention (DSA) (DeepSeekAI, 2025), whereas HCA applies more aggressive compression to the KV caches but keeps
dense attention. To strengthen modeling capability, we incorporate Manifold-Constrained
Hyper-Connections (mHC) (Xie et al., 2026) that upgrade conventional residual connections.
Additionally, we introduce the Muon (Jordan et al., 2024; Liu et al., 2025) optimizer to the
training of DeepSeek-V4 series, leading to faster convergence and improved training stability.
To enable efficient training and inference for DeepSeek-V4 series as well as productive development, we introduce several infrastructure optimizations. First, we design and implement
a single fused kernel for MoE modules that fully overlaps computation, communication, and
memory access. Second, we employ TileLang (Wang et al., 2026), a Domain-Specific Language
(DSL) to balance development productivity and runtime efficiency. Third, we provide efficient
batch-invariant and deterministic kernel libraries to ensure bitwise reproducibility across training and inference. Fourth, we incorporate FP4 quantization-aware training for MoE expert
weights and the indexer QK path to reduce memory and computation. Fifth, for the training
framework, we extend the autograd framework with tensor-level checkpointing for fine-grained
recomputation control; and we enhance training efficiency with a hybrid ZeRO strategy for the
Muon optimizer, cost-effective mHC implementations via recomputation and fused kernels, and
two-stage contextual parallelism to manage compressed attention. Finally, for the inference
framework, we design a heterogeneous KV cache structure with on-disk storage strategies to
enable efficient shared-prefix reuse.

4

>cd /Users/kiran/Downloads/pdf2kindle && pdftotext -f 2 -l 3 /Users/kiran/Downloads/DeepSeek_V4.pdf -

Contents
1

Introduction

4

2 Architecture

6

2.1

Designs Inherited from DeepSeek-V3 . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.2

Manifold-Constrained Hyper-Connections . . . . . . . . . . . . . . . . . . . . . .

7

2.3

Hybrid Attention with CSA and HCA . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.3.1

Compressed Sparse Attention . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.3.2

Heavily Compressed Attention . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.3.3

Other Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.3.4

Efficiency Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

Muon Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.4
3

General Infrastructures

15

3.1

Fine-Grained Communication-Computation Overlap in Expert Parallelism . . . .

15

3.2

Flexible and Efficient Kernel Development with TileLang . . . . . . . . . . . . . .

16

3.3

High-Performance Batch-Invariant and Deterministic Kernel Libraries . . . . . .

18

3.4

FP4 Quantization-Aware Training . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.5

Training Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3.5.1

Efficient Implementation of Muon . . . . . . . . . . . . . . . . . . . . . . .

20

3.5.2

Cost-Effective and Memory-Efficient Implementation of mHC . . . . . . .

21

3.5.3

Contextual Parallelism for Long-Context Attention . . . . . . . . . . . . .

21

3.5.4

Extended Automatic Differentiation for Flexible Activation Checkpointing 21

3.6

4

Inference Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

3.6.1

KV Cache Structure and Management . . . . . . . . . . . . . . . . . . . . .

22

3.6.2

On-Disk KV Cache Storage . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

Pre-Training

24

4.1

Data Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

4.2

Pre-Training Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

4.2.1

Model Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

4.2.2

Training Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

4.2.3

Mitigating Training Instability . . . . . . . . . . . . . . . . . . . . . . . . .

26

Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

4.3.1

Evaluation Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

4.3.2

Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

4.3

2

5

Post-Training

29

5.1

Post-Training Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

5.1.1

Specialist Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

5.1.2

On-Policy Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

RL and OPD Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

5.2.1

FP4 Quantization Integration . . . . . . . . . . . . . . . . . . . . . . . . . .

34

5.2.2

Efficient Teacher Scheduling for Full-Vocabulary OPD . . . . . . . . . . .

34

5.2.3

Preemptible and Fault-Tolerant Rollout Service . . . . . . . . . . . . . . .

34

5.2.4

Scaling RL Framework for Million-Token Context . . . . . . . . . . . . . .

35

5.2.5

Sandbox Infrastructure for Agentic AI . . . . . . . . . . . . . . . . . . . . .

35

Standard Benchmark Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

5.3.1

Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

5.3.2

Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

Performance on Real-World Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

5.4.1

Chinese Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

5.4.2

Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

5.4.3

White-Collar Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

5.4.4

Code Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

5.2

5.3

5.4

6

Conclusion, Limitations, and Future Directions

A Author List and Acknowledgment

44
54

A.1 Author List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

A.2 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

B Evaluation Details

55

3

I'm seeing that the table of contents and figure captions are cluttering the output, so I need to clean this up by removing the redundant TOC from the body since it's already in the metadata, stripping out figure labels and axis annotations, using the paper's actual structure to rebuild the headings properly, and making sure to escape any HTML-like syntax that might cause parsing issues.

Given the complexity of a 58-page LaTeX-generated PDF, I'm weighing two approaches: manually reconstructing the semantics using the paper's TOC structure, or applying targeted transformations to the extracted text. Since the extraction is mostly solid with just specific problem areas, I think the better move is to re-extract the raw text and build a custom reconstruction script that uses the paper's TOC as a scaffold to rebuild the markdown cleanly, stripping out the TOC pages and figure noise along the way.

>cd /Users/kiran/Downloads/pdf2kindle && pdftotext /Users/kiran/Downloads/DeepSeek_V4.pdf /tmp/dsv4_raw.txt && wc -l /tmp/dsv4_raw.txt && rg -n "^[0-9]\.[0-9].*\." /tmp/dsv4_raw.txt | head -30

4905 /tmp/dsv4_raw.txt
202:2.3.1
208:2.3.2
214:2.3.3
220:2.3.4
267:3.5.1
273:3.5.2
279:3.5.3
285:3.5.4
297:3.6.1
303:3.6.2
325:4.2.1
331:4.2.2
337:4.2.3
347:4.3.1
353:4.3.2
375:5.1.1
381:5.1.2
391:5.2.1
397:5.2.2
403:5.2.3
409:5.2.4
415:5.2.5
425:5.3.1
431:5.3.2
441:5.4.1
447:5.4.2
453:5.4.3
459:5.4.4
635:2.1. Designs Inherited from DeepSeek-V3
650:2.2. Manifold-Constrained Hyper-Connections

>cd /Users/kiran/Downloads/pdf2kindle && rg -n "^[0-9]+\.[0-9]+\.\s+[A-Z]" /tmp/dsv4_raw.txt | head -60

635:2.1. Designs Inherited from DeepSeek-V3
650:2.2. Manifold-Constrained Hyper-Connections
837:2.3. Hybrid Attention with CSA and HCA
1239:2.4. Muon Optimizer
1268:3.1. Fine-Grained Communication-Computation Overlap in Expert Parallelism
1393:3.2. Flexible and Efficient Kernel Development with TileLang
1451:3.3. High-Performance Batch-Invariant and Deterministic Kernel Libraries
1508:3.4. FP4 Quantization-Aware Training
1537:3.5. Training Framework
1630:3.6. Inference Framework
1812:4.1. Data Construction
1832:4.2. Pre-Training Setups
1940:4.3. Evaluations
2150:5.1. Post-Training Pipeline
2506:5.3. Standard Benchmark Evaluation
3178:5.4. Performance on Real-World Tasks

>cd /Users/kiran/Downloads/pdf2kindle && rg -n "^[0-9]+\.\s+[A-Z]" /tmp/dsv4_raw.txt | head -30

622:2. Architecture
1811:4. Pre-Training
1859:128. We employ MoE layers in all Transformer blocks, but use the Hash routing strategy for the
2149:5. Post-Training
3443:6. Conclusion, Limitations, and Future Directions
3545:07372. URL https://doi.org/10.48550/arXiv.2601.07372.
3587:2378. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1246. URL
3742:2021. URL https://proceedings.neurips.cc/paper/2021/hash/92bf5e6240737
3755:2019. URL http://arxiv.org/abs/1911.02150.
3771:2024. URL https://arxiv.org/abs/2407.11214.

>cd /Users/kiran/Downloads/pdf2kindle && rg -n "^1\. Introduction|^A\.|^A Author|^B Evaluation|Acknowledgment$|^A.[12]" /tmp/dsv4_raw.txt

475:A Author List and Acknowledgment
480:A.1 Author List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
484:A.2 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
488:B Evaluation Details
3481:A. Agache, M. Brooker, A. Florescu, A. Iordache, A. Liguori, R. Neugebauer, P. Piwonka, and
3529:A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse,
3530:A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage,
3539:A. Cheng, A. Jacovi, A. Globerson, B. Golan, C. Kwong, C. Alberti, C. Tao, E. Ben-David, G. S.
3716:A. Hutchison, A. Abdagic, A. Carl, A. Shen, A. Brock, A. Coenen, A. Laforge, A. Paterson,
3772:A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30,
3841:A. Author List and Acknowledgment
3842:A.1. Author List
3892:A.2. Acknowledgment

>cd /Users/kiran/Downloads/pdf2kindle && rg -n "^B\.|^B Evaluation Details|^B\. Evaluation" /tmp/dsv4_raw.txt | head -10

488:B Evaluation Details
3527:B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet,
3618:B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko.
3714:B. Shahriari, A. Ram’e, J. Ferret, P. Liu, P. D. Tafti, A. Friesen, M. Casbon, S. Ramos, R. Kumar,
3717:B. Bastian, B. Piot, B. Wu, B. Royal, C. Chen, C. Kumar, C. Perry, C. A. Welty, C. A. ChoquetteChoo, D. Sinopalnikov, D. Weinberger, D. Vijaykumar, D. Rogozi’nska, D. Herbison, E. Bandy,
3744:B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea,
3796:B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou,
3896:B. Evaluation Details

>cd /Users/kiran/Downloads/pdf2kindle && sed -n '3895,3905p' /tmp/dsv4_raw.txt

B. Evaluation Details
Table 9 | Agentic Search vs. Retrieval Augmented Search for DeepSeek-V4-Pro.
Difficulty Category

#

Agent Win RAG Win Tie Agent% RAG% Tie%

Easy

I'm settling on a concrete approach now. I'll create a Python script that processes the raw PDF text output to build clean markdown—using pdftotext without layout preservation to maintain proper reading order, then skipping the table of contents pages and starting from the introduction section, with regex patterns to detect and format major section headings appropriately.

The script will also handle cleanup tasks like removing page numbers, reflowing wrapped paragraphs, fixing hyphenated words split across lines, escaping XML-like tags that might confuse pandoc, stripping repeated headers and footers, and carefully removing orphaned figure caption fragments while preserving actual captions. I need to verify exactly how the section numbers and titles are laid out in the raw text before finalizing the regex patterns.

>cd /Users/kiran/Downloads/pdf2kindle && sed -n '618,665p' /tmp/dsv4_raw.txt

achieves comparable results on reasoning tasks when allocated a larger thinking budget. In agent evaluations, while DeepSeek-V4-Flash-Max matches the performance of
DeepSeek-V4-Pro-Max on several benchmarks, it still trails its larger counterpart on more
complex, high-difficulty tasks.

2. Architecture
Overall, DeepSeek-V4 series retain the Transformer (Vaswani et al., 2017) architecture and MultiToken Prediction (MTP) modules (DeepSeek-AI, 2024; Gloeckle et al., 2024), while introducing
several key upgrades over DeepSeek-V3: (1) firstly, we introduce the Manifold-Constrained
Hyper-Connections (mHC) (Xie et al., 2026) to strengthen conventional residual connections;
6

(2) secondly, we design a hybrid attention architecture, which greatly improves long-context
efficiency through Compressed Sparse Attention and Heavily Compressed Attention. (3) thirdly,
we employ Muon (Jordan et al., 2024; Liu et al., 2025) as the optimizer. For the Mixture-ofExperts (MoE) components, we still adopt the DeepSeekMoE (Dai et al., 2024) architecture, with
only minor adjustments from DeepSeek-V3. The Multi-Token Prediction (MTP) (DeepSeek-AI,
2024; Gloeckle et al., 2024; Li et al., 2024; Qi et al., 2020) configuration remains identical to
that of DeepSeek-V3. All other unspecified details follow the settings established in DeepSeekV3 (DeepSeek-AI, 2024). Figure 2 illustrates the overall architecture of DeepSeek-V4, and the
details are described below.
2.1. Designs Inherited from DeepSeek-V3
Mixture-of-Experts. As previous DeepSeek-series models (DeepSeek-AI, 2024; DeepSeek-AI,
2024), DeepSeek-V4 series also adopt the DeepSeekMoE paradigm (Dai et al., 2024) for FeedForward Networks (FFNs), which sets fine-grained routed experts and shared experts. Different
from DeepSeek-V3, we change the activation function that computes the affinity scores from
Sigmoid(·) into Sqrt(Softplus(·)). For load balancing, we also employ the auxiliary-loss-free
strategy (DeepSeek-AI, 2024; Wang et al., 2024a), augmented by a slight sequence-wise balance
loss that prevents extreme imbalance within individual sequences. For DeepSeek-V4, we remove
the constraint on the number of routing target nodes, and carefully redesign the parallelism
strategy to maintain training efficiency. Furthermore, compared with DeepSeek-V3, we replace
the dense FFN layers in the initial several Transformer blocks with MoE layers that employ
Hash routing (Roller et al., 2021). The Hash routing strategy determines the target experts of
each token according to a predefined hash function with regard to the input token ID.
Multi-Token Prediction. As DeepSeek-V3, DeepSeek-V4 series also set MTP modules and
objectives. Given that the MTP strategy has been validated in DeepSeek-V3, we adopt the same
strategy for DeepSeek-V4 series without modification.
2.2. Manifold-Constrained Hyper-Connections
As shown in Figure 2, DeepSeek-V4 series incorporate Manifold-Constrained Hyper-Connections
(mHC) (Xie et al., 2026) to strengthen the conventional residual connections between adjacent
Transformer blocks. Compared with naive Hyper-Connections (HC) (Zhu et al., 2025), the core
idea of mHC is to constrain the residual mapping onto a specific manifold, and thus enhance the
stability of signal propagation across layers while preserving model expressivity. This subsection
briefly introduces the standard HC and describes how we design mHC for stable training.
Standard Hyper-Connections. The standard HC expands the width of the residual stream
by a factor of 𝑛hc . Specifically, the shape of the residual stream is expanded from R𝑑 to R𝑛hc × 𝑑 ,
where 𝑑 is the hidden size of the actual layer input. Let 𝑋𝑙 = [x𝑙,1 ; . . . ; x𝑙,𝑛hc ] 𝑇 ∈ R𝑛hc × 𝑑 be the
residual state before the 𝑙-th layer. HC introduces three linear mappings: an input mapping
𝐴𝑙 ∈ R1× 𝑛hc , a residual transformation 𝐵𝑙 ∈ R𝑛hc × 𝑛hc , and an output mapping 𝐶 𝑙 ∈ R𝑛hc ×1 . The
update of the residual state is then formulated as:
𝑋𝑙+1 = 𝐵𝑙 𝑋𝑙 + 𝐶 𝑙 F𝑙 ( 𝐴𝑙 𝑋𝑙 ),

(1)

>cd /Users/kiran/Downloads/pdf2kindle && rg -n "^[0-9]+\.[0-9]+\.[0-9]+\." /tmp/dsv4_raw.txt | head -25

45:80.680.880.6
61:37.740.039.8
849:2.3.1. Compressed Sparse Attention
1078:2.3.2. Heavily Compressed Attention
1153:2.3.3. Other Details
1195:2.3.4. Efficiency Discussion
1543:3.5.1. Efficient Implementation of Muon
1586:3.5.3. Contextual Parallelism for Long-Context Attention
1604:3.5.4. Extended Automatic Differentiation for Flexible Activation Checkpointing
1633:3.6.1. KV Cache Structure and Management
1774:3.6.2. On-Disk KV Cache Storage
1833:4.2.1. Model Setups
1866:4.2.2. Training Setups
1903:4.2.3. Mitigating Training Instability
1941:4.3.1. Evaluation Benchmarks
2127:4.3.2. Evaluation Results
2154:5.1.1. Specialist Training
2306:5.1.2. On-Policy Distillation
2407:5.2.1. FP4 Quantization Integration
2414:5.2.2. Efficient Teacher Scheduling for Full-Vocabulary OPD
2433:5.2.3. Preemptible and Fault-Tolerant Rollout Service
2453:5.2.4. Scaling RL Framework for Million-Token Context
2463:5.2.5. Sandbox Infrastructure for Agentic AI
2507:5.3.1. Evaluation Setup
3185:5.4.1. Chinese Writing

>cd /Users/kiran/Downloads/pdf2kindle && rg -n "^3\.5\.2|Cost-Effective and Memory-Efficient" /tmp/dsv4_raw.txt | head -5

273:3.5.2
275:Cost-Effective and Memory-Efficient Implementation of mHC . . . . . . .
1573:3.5.2. Cost-Effective and Memory-Efficient Implementation of mHC

>cd /Users/kiran/Downloads/pdf2kindle && rg -n "^5\.4\.[0-9]|^5\.3\.2" /tmp/dsv4_raw.txt | head -8

431:5.3.2
441:5.4.1
447:5.4.2
453:5.4.3
459:5.4.4
3185:5.4.1. Chinese Writing
3226:5.4.3. White-Collar Task

>cd /Users/kiran/Downloads/pdf2kindle && rg -n "Search$|Code Agent|White-Collar" /tmp/dsv4_raw.txt | head -10 && rg -n "^5\.3\.2|Evaluation Results" /tmp/dsv4_raw.txt | head -10

455:White-Collar Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
461:Code Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3207:5.4.2. Search
3226:5.4.3. White-Collar Task
3397:5.4.4. Code Agent
3985:V4 Agentic Search
3986:V4 Retrieval Augmented Search
4002:Search
355:Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
431:5.3.2
433:Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
569:Summary of Core Evaluation Results
2127:4.3.2. Evaluation Results
2570:5.3.2. Evaluation Results

>cd /Users/kiran/Downloads/pdf2kindle && sed -n '485,500p' /tmp/dsv4_raw.txt

55

B Evaluation Details

55

3

1. Introduction
The emergence of reasoning models (DeepSeek-AI, 2025; OpenAI, 2024c) has established a
new paradigm of test-time scaling, driving substantial performance gains for Large Language
Models (LLMs). However, this scaling paradigm is fundamentally constrained by the quadratic
computational complexity of the vanilla attention mechanism (Vaswani et al., 2017), which
creates a prohibitive bottleneck for ultra-long contexts and reasoning processes. Concurrently,
the emergence of long-horizon scenarios and tasks — from complex agentic workflows to

>cd /Users/kiran/Downloads/pdf2kindle && sed -n '1,180p' /tmp/dsv4_raw.txt

DeepSeek-V4:
Towards Highly Efficient Million-Token Context Intelligence
DeepSeek-AI
[REDACTED]

Abstract
We present a preview version of DeepSeek-V4 series, including two strong Mixture-ofExperts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) and
DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of
one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA)
and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) ManifoldConstrained Hyper-Connections (mHC) that enhance conventional residual connections; (3)
and the Muon optimizer for faster convergence and greater training stability. We pre-train
both models on more than 32T diverse and high-quality tokens, followed by a comprehensive
post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-ProMax, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for
open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are
highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeekV4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared
with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby
making long-horizon tasks and further test-time scaling more feasible. The model checkpoints
are available at https://huggingface.co/collections/deepseek-ai/deepseek-v4.

DeepSeek-V4-Pro-Max

Claude-Opus-4.6-Max
90.2

89.1

85.9

Accuracy / Pass@1 (%)

80

78.1

75.6

GPT-5.4-xHigh

Gemini-3.1-Pro-High

32063168

3052

80.680.880.6
75.1

67.9

65.4

68.5

60 57.9

54.6

51.8
46.2 45.3

37.740.039.8

40

47.2

44.4

48.8

0.8
0.6

0.2
0.0

0

(Pass@1)

HLE

(Pass@1)

Apex Codeforces SWE
Shortlist
Verified
(Rating)
(Pass@1)

Knowledge & Reasoning

(Resolved)

Terminal
Bench 2.0
(Acc)

Toolathlon
(Pass@1)

Agentic Capabilities

256

512

768

Token Position (K)

1024

DeepSeek-V3.2
DeepSeek-V4-Pro
DeepSeek-V4-Flash

40

SimpleQA
Verified

3.7× lower
9.8× lower

0.4

50

20
0

DeepSeek-V3.2
DeepSeek-V4-Pro
DeepSeek-V4-Flash

1.0

Accumulated KV Cache (GB)

100

Single-Token FLOPs (T)

1.2

30

9.5× smaller

20

13.7× smaller

10
0

0

256

512

768

Sequence Length (K)

1024

Figure 1 | Left: benchmark performance of DeepSeek-V4-Pro-Max and its counterparts. Right:
inference FLOPs and KV cache size of DeepSeek-V4 series and DeepSeek-V3.2.

Contents
1

Introduction

4

2 Architecture

I'm seeing the document structure now—the body content starts around line 494 with the introduction, and everything before that is front matter like the title, abstract, and table of contents. Looking at how figures are positioned, there's a consistent pattern where figure captions appear between the text and the underlying chart data. The captions themselves carry important information, so I should preserve those while filtering out the raw axis fragments and chart metadata that follow them.

>cd /Users/kiran/Downloads/pdf2kindle && rg -n "^Figure [0-9]+ \||^Table [0-9]+ \|" /tmp/dsv4_raw.txt | head -20

170:Figure 1 | Left: benchmark performance of DeepSeek-V4-Pro-Max and its counterparts. Right:
605:Figure 2 | Overall architecture of DeepSeek-V4 series. We use hybrid CSA (Compressed Sparse
833:Figure 3 | Core architectures of CSA. It compresses the number of KV entries to 𝑚1 times, and
1037:Figure 4 | Core architectures of HCA. It performs heavier compression, where the KV entries of
1340:Figure 5 | Illustration of our EP scheme with related works. Comet (Zhang et al., 2025b) overlaps
1748:Figure 6 | Illustration of the KV cache Layout for DeepSeek-V4. The KV cache is organized into
2227:Table 3 | Instruction injected into the system prompt for the "Think Max" mode.
2295:Figure 7 | Thinking management of DeepSeek-V4 series.
2571:Table 6 | Comparison between DeepSeek-V4-Pro-Max and closed/open source models. "Max",
3017:Figure 8 | Formal reasoning under practical and frontier regimes. Left: Putnam-200 Pass@8
3081:Figure 9 | DeepSeek-V4 series performance on the MRCR task.
3175:Figure 10 | HLE and Terminal Bench 2.0 performance by reasoning effort. “None” indicates
3374:Figure 11 | Win-rate comparison across analysis, generation, editing tasks, and the overall
3390:Figure 12 | Detailed dimension scores including Task Completion, Content Quality, Formatting Aesthetics, and Instruction Following.
3392:Figure 13 | Example output of a task which requires drafting a joint marketing proposal for a
3405:Table 8 | Comparison on R&D Coding Benchmark (external models included strictly for evaluation purposes).
3897:Table 9 | Agentic Search vs. Retrieval Augmented Search for DeepSeek-V4-Pro.
3966:Table 10 | Cost Comparison:Agentic Search vs. Retrieval Augmented Search (Mean) for
3988:Table 11 | Comparative Evaluation of DeepSeek-V4-Pro and DeepSeek-V3.2 on Search Q&A
4143:Figure 15 | Example output of a task which requires researching 2020-2025 Nobel Science Prizes

>cd /Users/kiran/Downloads/pdf2kindle && rg -n "Table [0-9]+ \||^Table [0-9]+\b" /tmp/dsv4_raw.txt | head -20

1958:Table 1 | Comparison among DeepSeek-V3.2-Base, DeepSeek-V4-Flash-Base, and DeepSeek-V4Pro-Base. All models are evaluated in our internal framework and share the same evaluation
2184:Table 2 | Comparison of three reasoning modes
2227:Table 3 | Instruction injected into the system prompt for the "Think Max" mode.
2249:Table 4 | Tool-call schema for DeepSeek-V4 series.
2315:Table 5 | Quick Instruction special tokens for auxiliary tasks.
2571:Table 6 | Comparison between DeepSeek-V4-Pro-Max and closed/open source models. "Max",
2769:Table 7 | Comparison among different sizes and modes of DeepSeek-V4 series. "Non-Think",
3195:Table 13 presents the creative writing comparison, which is evaluated along two axes:
3212:Table 11, DeepSeek-V4-Pro outperforms DeepSeek-V3.2 by a substantial margin, demonstrating
3405:Table 8 | Comparison on R&D Coding Benchmark (external models included strictly for evaluation purposes).
3897:Table 9 | Agentic Search vs. Retrieval Augmented Search for DeepSeek-V4-Pro.
3966:Table 10 | Cost Comparison:Agentic Search vs. Retrieval Augmented Search (Mean) for
3988:Table 11 | Comparative Evaluation of DeepSeek-V4-Pro and DeepSeek-V3.2 on Search Q&A
4148:Table 12 | Comparative Analysis of DeepSeek-V4-Pro and Gemini-3.1-Pro in Chinese Functional
4600:Table 13 | Comparative Analysis of DeepSeek-V4-Pro and Gemini-3.1-Pro in Chinese Creative
4865:Table 14 | DeepSeek-V4-Pro vs. Claude-Opus-4.5 on Complex Instruction Following and MultiTurn Writing.

>cd /Users/kiran/Downloads/pdf2kindle && sed -n '2225,2310p' /tmp/dsv4_raw.txt

summary

Table 3 | Instruction injected into the system prompt for the "Think Max" mode.
Injected Instruction
Reasoning Effort: Absolute maximum with no shortcuts permitted.
You MUST be very thorough in your thinking and comprehensively decompose the
problem to resolve the root cause, rigorously stress-testing your logic against all potential
paths, edge cases, and adversarial scenarios.
Explicitly write out your entire deliberation process, documenting every intermediate
step, considered alternative, and rejected hypothesis to ensure absolutely no assumption
is left unchecked.

model leverages its own logic to generalize across complex tasks.
Tool-Call Schema and Special Token. Consistent with our previous version, we utilize a
dedicated <think></think> tag to delineate the reasoning path. In DeepSeek-V4 series, we
introduce a new tool-call schema that employs a special "|DSML|" token and utilizes an XMLbased format for tool invocations, as demonstrated in Table 4. Our experiments demonstrate that
the XML format effectively mitigates escaping failures and reduces tool-call errors, providing a
more robust interface for model-tool interactions.
Interleaved Thinking. DeepSeek-V3.2 introduced a context management strategy that retains
reasoning traces across tool-result rounds but discards them upon the arrival of new user messages. While effective, this still caused unnecessary token waste in complex agentic workflows
— each new user turn would flush all accumulated reasoning content, forcing the model to
reconstruct its problem-solving state from scratch. Leveraging the expanded 1M-token context
30

Table 4 | Tool-call schema for DeepSeek-V4 series.
Tool Call Schema

## Tools
You have access to a set of tools to help answer the user’s question. You can
invoke tools by writing a "<|DSML|tool_calls>" block like the following:
<|DSML|tool_calls>
<|DSML|invoke name="$TOOL_NAME">
<|DSML|parameter name="$PARAMETER_NAME" string="true|false">$PARAMETER_VALUE
</|DSML|parameter>
...
</|DSML|invoke>
<|DSML|invoke name="$TOOL_NAME2">
...
</|DSML|invoke>
</|DSML|tool_calls>
String parameters should be specified as is and set ‘string="true"‘. For all
other types (numbers, booleans, arrays, objects), pass the value in JSON
format and set ‘string="false"‘.
If thinking_mode is enabled (triggered by <think>), you MUST output your
complete reasoning inside <think>...</think> BEFORE any tool calls or
final response.
Otherwise, output directly after </think> with tool calls or final response.
### Available Tool Schemas
{Tool Definition...}
You MUST strictly follow the above definedtool name and parameter schemas to
invoke tool calls.

window of DeepSeek-V4 series, we further refine this mechanism to maximize the effectiveness
of interleaved thinking in agentic environments:
• Tool-Calling Scenarios. As illustrated in Figure 7(a), all reasoning content is fully preserved throughout the entire conversation. Unlike DeepSeek-V3.2, which discarded
thinking traces upon each new user turn, DeepSeek-V4 series retain the complete reasoning history across all rounds, including across user message boundaries. This allows the
model to maintain a coherent, cumulative chain of thought over long-horizon agent tasks.
• General Conversational Scenarios. As illustrated in Figure 7(b), the original strategy is
preserved: reasoning content from previous turns is discarded when a new user message
arrives, keeping the context concise for settings where persistent reasoning traces provide
limited benefit.
As with DeepSeek-V3.2, agent frameworks that simulate tool interactions via user messages (e.g.,
Terminus) may not trigger the tool-calling context path and thus may not benefit from enhanced
reasoning persistence. We continue to recommend non-think models for such architectures.
31

a) Thinking with tools

b) Thinking without tools

Figure 7 | Thinking management of DeepSeek-V4 series.
Quick Instruction. In chatbot scenarios, a number of auxiliary tasks (e.g., determining whether
to trigger a web search, intent recognition, etc.) must be executed before generating the response.
Conventionally, these tasks are handled by a separate small model, requiring redundant prefilling since it cannot reuse the existing KV cache. To overcome this limitation, we introduce Quick
Instruction. We append a set of dedicated special tokens directly to the input sequence, where
each token corresponds to a specific auxiliary task. By directly reusing the already-computed
KV cache, this mechanism completely avoids redundant prefilling and allows certain tasks, such
as generating search queries and determining authority and domain, to be executed in parallel.
Consequently, this approach significantly reduces the user-perceived time-to-first-token (TTFT)
and eliminates the engineering overhead of maintaining and iterating an extra small model. The
supported Quick Instruction tokens are summarized in Table 5.
5.1.2. On-Policy Distillation
After training multiple domain-specific experts via specialized fine-tuning and reinforcement
learning, we employ multi-teacher On-Policy Distillation (OPD) as the primary technique for
merging expert capabilities into the final model. OPD has emerged as an effective post-training
paradigm for efficiently transferring the knowledge and capabilities of domain experts to a

I'm realizing those markup tags are literal text examples in the paper that need to be escaped in a code block so pandoc doesn't interpret them. I'm going to write a custom build script that reads the raw PDF text output and processes it carefully, starting by skipping past the title and abstract sections.

The script will walk through the document detecting heading patterns and converting them to markdown, joining wrapped lines within paragraphs, removing page numbers, and identifying code-block content like the instruction text and tool-call schemas to wrap in fenced blocks. I'll also need to escape any raw <think> tags or <|DSML|...> tokens that appear in regular prose by wrapping them in backticks. For the references section at the end, I should check whether it's worth preserving as-is since it's a standard bibliography.

>cd /Users/kiran/Downloads/pdf2kindle && sed -n '3470,3530p' /tmp/dsv4_raw.txt

will continue to iterate and explore in this direction. We are also working on incorporating
multimodal capabilities to our models. Finally, we are committed to developing better data
curation and synthesis strategies to consistently enhance model intelligence, robustness, and
practical usability across an increasingly broad range of scenarios and tasks.

References
AA. Gdpval-aa leaderboard, 2025. URL https://artificialanalysis.ai/methodolog
y/intelligence-benchmarking#gdpval-aa.
T. Achim, A. Best, A. Bietti, K. Der, M. Fédérico, S. Gukov, D. Halpern-Leistner, K. Henningsgard,
Y. Kudryashov, A. Meiburg, et al. Aristotle: Imo-level automated theorem proving. arXiv
preprint arXiv:2510.01346, 2025.
A. Agache, M. Brooker, A. Florescu, A. Iordache, A. Liguori, R. Neugebauer, P. Piwonka, and
D.-M. Popa. Firecracker: lightweight virtualization for serverless applications. In Proceedings
of the 17th Usenix Conference on Networked Systems Design and Implementation, NSDI’20,
page 419–434, USA, 2020. USENIX Association. ISBN 9781939133137.
O. J. Aimuyo, B. Oh, and R. Singh. Flashmoe: Fast distributed moe in a single kernel. Advances
in Neural Information Processing Systems, 2025. URL https://neurips.cc/virtual/2
025/poster/119124.
J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai. Gqa: Training
generalized multi-query transformer models from multi-head checkpoints. arXiv preprint
arXiv:2305.13245, 2023.
J. Asher. LeanExplore: A search engine for Lean 4 declarations, 2025. URL https://arxiv.or
g/abs/2506.11085.
Y. Bai, Y. Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y. Chen, Y. Chen, Y. Chen, Z. Chen, J. Cui,
H. Ding, M. Dong, A. Du, C. Du, D. Du, Y. Du, Y. Fan, Y. Feng, K. Fu, B. Gao, H. Gao, P. Gao,
T. Gao, X. Gu, L. Guan, H. Guo, J. Guo, H. Hu, X. Hao, T. He, W. He, W. He, C. Hong, Y. Hu,
Z. Hu, W. Huang, Z. Huang, Z. Huang, T. Jiang, Z. Jiang, X. Jin, Y. Kang, G. Lai, C. Li, F. Li,
H. Li, M. Li, W. Li, Y. Li, Y. Li, Z. Li, Z. Li, H. Lin, X. Lin, Z. Lin, C. Liu, C. Liu, H. Liu, J. Liu,
J. Liu, L. Liu, S. Liu, T. Y. Liu, T. Liu, W. Liu, Y. Liu, Y. Liu, Y. Liu, Y. Liu, Z. Liu, E. Lu, L. Lu,
S. Ma, X. Ma, Y. Ma, S. Mao, J. Mei, X. Men, Y. Miao, S. Pan, Y. Peng, R. Qin, B. Qu, Z. Shang,
L. Shi, S. Shi, F. Song, J. Su, Z. Su, X. Sun, F. Sung, H. Tang, J. Tao, Q. Teng, C. Wang, D. Wang,
F. Wang, and H. Wang. Kimi K2: open agentic intelligence. CoRR, abs/2507.20534, 2025a.
URL https://doi.org/10.48550/arXiv.2507.20534.
Y. Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y. Dong, et al. Longbench
v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 3639–3664, 2025b.

45

M. Balunović, J. Dekoninck, I. Petrov, N. Jovanović, and M. Vechev. Matharena: Evaluating llms
on uncontaminated math competitions. Proceedings of the Neural Information Processing
Systems Track on Datasets and Benchmark, 2025.
C. Bandi, B. Hertzberg, G. Boo, T. Polakam, J. Da, S. Hassaan, M. Sharma, A. Park, E. Hernandez,
D. Rambado, et al. Mcp-atlas: A large-scale benchmark for tool-use competency with real
mcp servers. arXiv preprint arXiv:2602.00933, 2026.
F. Bellard. Qemu, a fast and portable dynamic translator. In Proceedings of the Annual
Conference on USENIX Annual Technical Conference, ATEC ’05, page 41, USA, 2005. USENIX
Association.
I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio. Neural combinatorial optimization with
reinforcement learning, 2017. URL https://openreview.net/forum?id=rJY3vK9eg.
J. Chen, W. Chen, J. Du, J. Hu, Z. Jiang, A. Jie, X. Jin, X. Jin, C. Li, W. Shi, Z. Wang, M. Wang,
C. Wei, S. Wei, H. Xin, F. Yang, W. Gao, Z. Yuan, T. Zhan, Z. Zheng, T. Zhou, and T. H.
Zhu. Seed-prover 1.5: Mastering undergraduate-level theorem proving via learning from
experience, 2025. URL https://arxiv.org/abs/2512.17260.
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda,
N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin,
B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet,
F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss,
A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse,
A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage,

>cd /Users/kiran/Downloads/pdf2kindle && sed -n '3838,3900p' /tmp/dsv4_raw.txt

53

Appendix
A. Author List and Acknowledgment
A.1. Author List
Authors are listed alphabetically by their first name. Names marked with * denote individuals
who have departed from our team.
Research & Engineering: Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang*, Bingzheng Xu,
Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi
Deng, Chenhao Xu, Chenze Shao, Chong Ruan*, Conner Sun, Damai Dai, Daya Guo*, Dejian
Yang, Deli Chen, Donghao Li, Erhang Li, Fangyun Lin, Fangzhou Yuan, Feiyu Xia, Fucong
Dai, Guangbo Hao, Guanting Chen, Guoai Cao, Guolai Meng, Guowei Li, Han Yu, Han Zhang,
Hanwei Xu, Hao Li, Haofen Liang, Haoling Zhang, Haoming Luo, Haoran Wei*, Haotian Yuan,
Haowei Zhang*, Haowen Luo, Haoyu Chen, Haozhe Ji, Honghui Ding, Hongxuan Tang, Huanqi
Cao, Huazuo Gao, Hui Qu, Hui Zeng, J. Yang, J.Q. Zhu, Jia Yu, Jialiang Huang, Jiasheng Ye,
Jiashi Li, Jiaxin Xu, Jiewen Hu, Jin Yan, Jingchang Chen, Jingli Zhou, Jingting Xiang, Jingyang
Yuan, Jingyuan Cheng, Jinhua Zhu, Jiping Yu, Joseph Sun, Jun Ran*, Junguang Jiang, Junjie Qiu,
Junlong Li*, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Kexing Zhou, Kezhao Huang*,
Kuai Yu, Lean Wang, Lecong Zhang, Lei Wang, Li Zhang, Liang Zhao, Lihua Guo, Lingxiao
Luo, Linwang Ma, Litong Wang, Liyu Cai, Liyue Zhang, Longhao Chen, M.S. Di, M.Y Xu,
Max Mei, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingxu Zhou, Panpan Huang,
Peixin Cong, Peiyi Wang, Qiancheng Wang, Qihao Zhu, Qingyang Li, Qinyu Chen, Qiushi Du,
Qiwei Jiang, Rui Tian, Ruifan Xu, Ruijie Lu, Ruiling Xu, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan,
Runji Wang, Runqian Chen, Runqiu Yin, Runxin Xu, Ruomeng Shen, Ruoyu Zhang, S.H. Liu,
Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaofei Cai, Shaoheng Nie, Shaoyuan Chen,
Shengding Hu, Shengyu Liu, Shiqiang Hu, Shirong Ma, Shiyu Wang, Shuiping Yu, Shunfeng
Zhou, Shuting Pan, Shuying Yu, Songyang Zhou, Tao Ni, Tao Yun, Tian Jin, Tian Pei, Tian Ye,
Tianle Lin, Tianran Ji, Tianyi Cui, Tianyuan Yue, Tingting Yu, Tun Wang, W. Zhang, Wangding
Zeng, Weilin Zhao, Wen Liu, Wenfeng Liang, Wenjie Pang, Wenjing Luo, Wenjing Yao, Wenjun
Gao, Wenkai Yang, Wenlve Huang, Wentao Zhang, Wenting Ma, Xi Gao, Xiang He, Xiangwen
Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaokang Zhang, Xiaotao Nie,
Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingchen Liu, Xingkai Yu, Xingyou Li, Xinyu Yang,
Xu Chen, Xuanyu Wang, Xuecheng Su, Xuheng Lin, Xuwei Fu, Y.C. Yan, Y.Q. Wang*, Y.W. Ma,
Yanfeng Luo, Yang Zhang, Yanhong Xu, Yanru Ma, Yanwen Huang, Yao Li, Yao Li, Yao Zhao,
Yaofeng Sun, Yaohui Wang, Yi Qian, Yi Yu, Yichao Zhang, Yifan Ding, Yifan Shi, Yijia Wu, Yiliang
Xiong, Ying He, Ying Zhou, Yingjia Luo, Yinmin Zhong, Yishi Piao, Yisong Wang, Yixiang Zhang,
Yixiao Chen, Yixuan Tan, Yixuan Wei, Yiyang Ma, Yiyuan Liu, Yonglun Yang, Yongqiang Guo,
Yongtong Wu, Yu Wu, Yuan Cheng, Yuan Ou, Yuanfan Xu, Yuanhao Li, Yuduan Wang, Yuhan
Wu, Yuhao Meng, Yuheng Zou, YuKun Li, Yunfan Xiong, Yupeng Chen, Yuqian Cao, Yuqian
Wang, Yushun Zhang, Yutong Lin, Yuxian Gu, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuxuan
Zhou, Yuyang Zhou, Yuzhen Huang, Z.F. Wu, Zehao Wang, Zehua Zhao, Zehui Ren, Zhangli
Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma,
Zhigang Yan, Zhihong Shao, Zhixian Huang, Zhixuan Chen, Zhiyu Wu, Zhizhou Ren, Zhuoshu
Li, Zhuping Zhang, Zian Xu, Zihao Wang, Zihui Gu, Zijia Zhu, Zilin Li, Zipeng Zhang*, Ziwei
Xie, Ziyi Gao, Zizheng Pan, Zongqing Yao.
Business & Compliance: Chenchen Ling, Chengyu Hou, Dongjie Ji, Fang Wei, Hengqing Zhang,
Jia Luo, Jia Song, Jialu Cai, Jian Liang, Jiangting Zhou, Jieyu Yang, Jin Chen, Jingzi Zhou, Junmin
Zheng, Leyi Xia, Linyan Zhu, Miaojun Wang, Mingming Li, Minmin Han, Ning Wang, Panpan
54

Wang, Peng Zhang, Ruyi Chen, Shangmian Sun, Shaoqing Wu, W.L. Xiao, Wei An, Wenqing
Hou, Xianzu Wang, Xiaowen Sun, Xiaoxiang Wang, Xinyu Zhang, Xueyin Chen, Yao Xu, Yi
Shao, Yiling Ma, Ying Tang, Yuehan Yang, Yuer Xu, Yukun Zha, Yuping Lin, Yuting Yan, Zekai
Zhang, Zhe Ju, Zheren Gao, Zhongyu Wu, Zihua Qu, Ziyi Wan.
A.2. Acknowledgment
We would like to thank Dolly Deng and other testers for their valuable suggestions and feedback
regarding the capabilities of DeepSeek-V4 series models.

B. Evaluation Details
Table 9 | Agentic Search vs. Retrieval Augmented Search for DeepSeek-V4-Pro.
Difficulty Category

#

Now I'm mapping out the build script structure—I'll read the raw PDF text output, skip the initial boilerplate up to the introduction, then process each line to rejoin wrapped paragraphs and detect heading patterns so I can emit proper markdown formatting while filtering out extraneous content.

I need to handle several special cases: drop standalone page numbers, preserve figure and table captions as italicized text, strip orphan label-like lines conservatively, and wrap Tables 3 and 4 in code fences to protect their XML-like structure. For the <think> tags and similar constructs scattered through the prose, I'll escape them by wrapping in backticks to prevent XML parsing issues.

References can stay mostly as-is but I'll need to rejoin any URLs that got split across lines, and the appendix author list is fine to keep. The Unicode mathematical characters like 𝑍 and 𝐻 ·𝑊 should render fine without special handling—Pandoc won't misinterpret them as TeX math.

The real issue is stray $ characters triggering math mode and causing Pandoc to choke on sequences like $...$ that contain XML-like syntax. I need to search for and escape those dollar signs.

>cd /Users/kiran/Downloads/pdf2kindle && rg -n '\$' /tmp/dsv4_raw.txt

2256:<|DSML|invoke name="$TOOL_NAME">
2257:<|DSML|parameter name="$PARAMETER_NAME" string="true|false">$PARAMETER_VALUE
2261:<|DSML|invoke name="$TOOL_NAME2">

I see that Table 4 has dollar signs in the schema that need to be wrapped in a code fence to avoid parsing issues. Table 3 looks fine with just plain instruction text. Now I'm moving on to draft the build script that will process the raw PDF text output, walking through the lines and emitting the formatted content.

I'm setting up heading detection patterns that anchor to the start of each line—single-level headings like "Introduction" and "Architecture," multi-level numbered sections, lettered appendices, and special sections like References. Page numbers that appear as standalone integers get dropped, while figure and table captions get formatted as italicized blocks.

For the problematic tables, I'm identifying them by their content boundaries—Table 3 ends before "model leverages" and Table 4 ends before "window of DeepSeek-V4"—and wrapping those ranges in code fences. Actually, I'm realizing the core issue is just escaping the special tokens like <think> and <|DSML|...> that appear in the schema, so I'll handle that with a global replacement approach instead.

Building the script now with a simpler architecture: read the raw text, skip everything before "1. Introduction," then process paragraph by paragraph. For each block, check if it's a heading pattern, a page number to skip, a figure or table caption, or regular prose that needs dehyphenation and HTML escaping.

Tables 3 and 4 get special handling by detecting their starting patterns and wrapping them in code fences. For URLs split across lines, I need to detect when a line contains http and the next line continues without a space, then join them without adding extra whitespace to preserve the URL structure.

For soft hyphens introduced by PDF wrapping, I'll use a simple heuristic: if a line ends with - and the next line starts with a lowercase letter, I'll remove the hyphen and join directly, assuming it's a word break rather than a genuine compound word.

For URLs split across lines, I need to be more careful. Rather than trying to detect every possible URL pattern upfront, I'll write a targeted regex that handles the specific broken patterns I see in the references, then do a broader cleanup pass to catch any remaining splits where a line ends with / or contains http and the next line continues with what looks like a URL fragment.

>cd /Users/kiran/Downloads/pdf2kindle && rg -B0 -A1 -n "https?://" /tmp/dsv4_raw.txt | head -40

18:are available at https://huggingface.co/collections/deepseek-ai/deepseek-v4.
19-
--
856:1 https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/tree/main/inference
857-
--
1397:2 https://github.com/deepseek-ai/DeepGEMM/pull/304
1398-
--
1479:3 https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplicat
1480-ion/index.html#wave-quant
1481:4 https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/writing-cuda-kernels
1482-.html#distributed-shared-memory
--
3476:AA. Gdpval-aa leaderboard, 2025. URL https://artificialanalysis.ai/methodolog
3477-y/intelligence-benchmarking#gdpval-aa.
--
3486:in Neural Information Processing Systems, 2025. URL https://neurips.cc/virtual/2
3487-025/poster/119124.
--
3491:J. Asher. LeanExplore: A search engine for Lean 4 declarations, 2025. URL https://arxiv.or
3492-g/abs/2506.11085.
--
3502:URL https://doi.org/10.48550/arXiv.2507.20534.
3503-Y. Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y. Dong, et al. Longbench
--
3520:reinforcement learning, 2017. URL https://openreview.net/forum?id=rJY3vK9eg.
3521-J. Chen, W. Chen, J. Du, J. Hu, Z. Jiang, A. Jie, X. Jin, X. Jin, C. Li, W. Shi, Z. Wang, M. Wang,
--
3524:experience, 2025. URL https://arxiv.org/abs/2512.17260.
3525-M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda,
--
3533:URL https://arxiv.org/abs/2107.03374.
3534-T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze,
--
3537:ISBN 978-1-939133-08-3. URL https://www.usenix.org/conference/osdi18/prese
3538-ntation/chen.
--
3545:07372. URL https://doi.org/10.48550/arXiv.2601.07372.
3546-K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek,

I'm noticing a pattern where URLs get split across line breaks without any special formatting—like https://artificialanalysis.ai/methodolog continuing as y/intelligence-benchmarking#gdpval-aa. on the next line. To catch this, I need to check if a line contains a URL scheme and ends with a non-space character while the next line starts with one, then join them directly.

Implementing a function that handles paragraph line joining... Refining the regex approach...

I'm realizing the regex needs to be smarter about when to stitch URLs across line breaks. The issue is distinguishing between a genuinely broken URL and a URL that's legitimately followed by new prose. If the next line starts with a capital letter, it's probably a new sentence, not a URL continuation. So I should only stitch when the fragment after the space looks like it's part of a URL—lowercase start, URL-safe characters—rather than the beginning of a new sentence. I'm realizing the issue with my regex approach — when I join lines with a space, I'm accidentally capturing the start of the next paragraph as part of the URL match. I need a better heuristic to detect when a URL is actually split across lines versus when it's genuinely followed by new content.

Looking at the actual example, the URL https://artificialanalysis.ai/methodolog ends mid-word on one line, and the next line continues with y/intelligence-benchmarking#gdpval-aa. — so I should join without a space when the previous line's URL ends without a clear terminator and the next line starts with URL-class characters like slashes or dots.

The simplest approach: when the last word of a line contains http and the next line begins with lowercase letters or URL punctuation like /, ?, #, or ., treat it as a URL continuation and join without spacing.

>cd /Users/kiran/Downloads/pdf2kindle && awk 'NR==FNR && /https?:\/\// {url[NR]=1; next} url[FNR-1]' /tmp/dsv4_raw.txt /tmp/dsv4_raw.txt | head -30

ion/index.html#wave-quant
.html#distributed-shared-memory
y/intelligence-benchmarking#gdpval-aa.
025/poster/119124.
g/abs/2506.11085.
Y. Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y. Dong, et al. Longbench
J. Chen, W. Chen, J. Du, J. Hu, Z. Jiang, A. Jie, X. Jin, X. Jin, C. Li, W. Shi, Z. Wang, M. Wang,
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda,
T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze,
ntation/chen.
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek,
46
L. De Moura and N. Bjørner. Z3: an efficient smt solver. In Proceedings of the Theory
931.
org/10.48550/arXiv.2412.19437.
434.
DeepSeek-AI. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nat.,
DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025. URL
X. Deng, J. Da, E. Pan, Y. Y. He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath,
H. Ding, Z. Wang, G. Paolini, V. Kumar, A. Deoras, D. Roth, and S. Soatto. Fewer truncations
X. Du, Y. Yao, K. Ma, B. Wang, T. Zheng, K. Zhu, M. Liu, Y. Liang, X. Jin, Z. Wei, et al. Supergpqa:
X. Gao, M. Dong, X. Miao, W. Du, C. Yu, and H. Chen. Erofs: a compression-friendly readonly
g/10.48550/arXiv.2406.04127.
L. Haas, G. Yona, G. D’Antonio, S. Goldshtein, and D. Das. Simpleqa verified: A reliable
B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko.
H. Li, Y. Yuan, R. Du, K. Ma, L. Liu, and W. Hsu. DADI: Block-Level image service for agile
48

I'm parsing out the URL fragments to identify where each link is pointing—there's a performance matrix doc from NVIDIA, a CUDA programming guide section on kernels, and what looks like an artificial intelligence benchmarking resource.

The pattern I'm noticing is that when a URL fragment continues on the next line, it starts with lowercase characters or digits rather than a capital letter like a normal sentence would. So my heuristic is straightforward: if the previous line ends with http:// or https:// and the next line begins with a lowercase letter or digit, I should join them without a space.

Now I'm ready to write the build script that applies this logic.

Now I have enough to design a robust custom builder. Let me create a dedicated script for this paper.

>cd /Users/kiran/Downloads/pdf2kindle && rg -n "Total \(总计\)|Subtotal" /tmp/dsv4_raw.txt | head -10

3950:Total (总计)
4008:Subtotal (小计)
4135:Subtotal (小计)
4241:Subtotal (小计)
4311:Subtotal (小计)
4380:Subtotal (小计)
4441:Subtotal (小计)
4500:Subtotal (小计)
4557:Subtotal (小计)
4596:Total (总计)

>cd /Users/kiran/Downloads/pdf2kindle && sed -n '3895,4020p' /tmp/dsv4_raw.txt

B. Evaluation Details
Table 9 | Agentic Search vs. Retrieval Augmented Search for DeepSeek-V4-Pro.
Difficulty Category

#

Agent Win RAG Win Tie Agent% RAG% Tie%

Easy

Objective Q&A (客观问答) 196
Subjective Q&A (主观问答) 321

110
198

43
56

43
67

56.1
61.7

21.9
17.4

21.9
20.9

Hard

Objective Q&A (客观问答) 168
Subjective Q&A (主观问答) 184

102
126

33
27

33
31

60.7
68.5

19.6
14.7

19.6
16.8

Total (总计)

536

159

174

61.7

18.3

20.0

869

Table 10 | Cost Comparison:Agentic Search vs. Retrieval Augmented Search (Mean) for
DeepSeek-V4-Pro. Most of the tool calls are parallel for Agentic Search.
Version

Tool Calls

Prefill (tokens)

Output (tokens)

16.2
—

13649
10453

1526
1308

V4 Agentic Search
V4 Retrieval Augmented Search

Table 11 | Comparative Evaluation of DeepSeek-V4-Pro and DeepSeek-V3.2 on Search Q&A
Tasks.
Internal Evaluation (内部综合评估)
Category

Subcategory

# V4 win V3.2 win

tie V4% V3.2% tie%

95
Objective Single-value Search (单值信息查找)
Entity
Search
(实体信息查找)
99
Q&A
Enumerative Search (枚举型信息查找) 95
(客观问答)
Subtotal (小计)
289

36
24
19

10
7
8

49
68

>cd /Users/kiran/Downloads/pdf2kindle && sed -n '4595,4905p' /tmp/dsv4_raw.txt | head -100 && echo "---" && wc -l /tmp/dsv4_raw.txt

Total (总计)

57

Table 13 | Comparative Analysis of DeepSeek-V4-Pro and Gemini-3.1-Pro in Chinese Creative
Writing.

Subcategory (文体)

#

Instruction Following(指令遵循)

Writing Quality (写作质量)

DS Gem Tie DS% Gem% Tie%

DS Gem Tie DS% Gem% Tie%

504
368
253
111
115
83
74
81
24
23
8
20
15
11
5
2
5
1

Fiction (小说故事)
General Fiction (泛小说故事)
Fan Fiction (同人文)
General Fan Fic. (泛同人文)
Narrative (记叙文)
General Prose (泛散文)
Prose (散文)
Writing Style (文笔)
Classical Poetry (古诗文)
Modern Poetry (现代诗)
Lyrics (歌词)
Literary Appreciation (赏析)
General Argument. (泛议论文)
General Narrative (泛记叙文)
General Classical (泛古文诗歌)
Creative Writing (创意写作)
Argumentative (议论文)
General Mod. Poetry (泛现代诗)

836
662
410
202
171
124
112
112
48
43
30
27
24
23
9
6
5
2

323
290
150
90
54
40
38
31
24
20
22
7
9
12
4
4
0
1

Total (总计)

2837 1703 1119

---
    4905 /tmp/dsv4_raw.txt

>cd /Users/kiran/Downloads/pdf2kindle && pdftotext -layout -f 56 -l 58 /Users/kiran/Downloads/DeepSeek_V4.pdf - | head -120

Figure 14 | Example output of a task that requires comparing two regular investment strategies
for the NASDAQ.




Figure 15 | Example output of a task which requires researching 2020-2025 Nobel Science Prizes
and generating an analytical PDF report.




                                             56
Table 12 | Comparative Analysis of DeepSeek-V4-Pro and Gemini-3.1-Pro in Chinese Functional
Writing.

                                                       Internal Evaluation (内部综合评估)
Category     Subcategory                         # DS win Gem win Tie      DS% Gem% Tie%
         Report (报告)                         527      350       162   15   66.41   30.74   2.85
         Proposal (方案策划)                     291      181       103    7   62.20   35.40   2.41
         Education (教育培训)                    159      100        56    3   62.89   35.22   1.89
Business Email & Letter (邮件书信)               146      107        37    2   73.29   25.34   1.37
Writing  Notice (通知公告)                        72       43        24    5   59.72   33.33   6.94
(办公文本) Professional (专业文本)                    63       34        27    2   53.97   42.86   3.17
         Recruitment (招聘求职)                   42       27        15    0   64.29   35.71   0.00
         Technical (技术文本)                     29       22         7    0   75.86   24.14   0.00
         Review (介绍评价)                        20       15         5    0   75.00   25.00   0.00
             Subtotal (小计)                  1349      879       436   34   65.16   32.32   2.52
        Social Media (社交媒体文案)                267      156       101   10   58.43   37.83 3.75
        Ad Copy (广告商品文案)                     214      109        98    7   50.93   45.79 3.27
        Long-form Content (内容平台长文)            99       71        25    3   71.72   25.25 3.03
Media   News Report (新闻报道)                    51       27        22    2   52.94   43.14 3.92
Writing Advertorial (营销软文)                    17       12         4    1   70.59   23.53 5.88
(媒体文本) Headline (标题)                          11        7         4    0   63.64   36.36 0.00
        Narration Script (口播文案)                4        2         1    1   50.00   25.00 25.00
        Comment (评论)                           3        2         1    0   66.67   33.33 0.00
             Subtotal (小计)                   666      386       256   24   57.96   38.44   3.60
         Congratulatory (祝贺文本)               101       54        41   6    53.47   40.59   5.94
Everyday Communication   (沟通回复)              100       71        26   3    71.00   26.00   3.00
Writing  Reflection (心得感想)                    90       68        17   5    75.56   18.89   5.56
         Review (介绍评价)                        55       44         9   2    80.00   16.36   3.64
(生活文本)
         Comment (评论)                         44       34         8   2    77.27   18.18   4.55
             Subtotal (小计)                   390      271       101   18   69.49   25.90   4.62
        Speech (发言稿)                         226      135        85   6 59.73      37.61   2.65
        Narration Script (口播文案)               51       25        23   3 49.02      45.10   5.88
Oral
Writing Sales Script (话术)                     31       22         6   3 70.97      19.35   9.68
        Dialogue (对话文本)                       10        4         6   0 40.00      60.00   0.00
(口头文本)
        Congratulatory (祝贺文本)                  1        1         0   0 100.00      0.00   0.00
             Subtotal (小计)                   319      187       120   12   58.62   37.62   3.76
         Administrative Doc (事务文书)           117       60        53   4    51.28   45.30 3.42
         Personal Doc (个人文书)                  73       45        27   1    61.64   36.99 1.37
Official
Document Government Doc (行政公文)                34       19        14   1    55.88   41.18 2.94
(公文文本) Speech (发言稿)                            3        1         2   0    33.33   66.67 0.00
         Essay Writing (申论写作)                  3        1         1   1    33.33   33.33 33.33
             Subtotal (小计)                   230      126        97    7   54.78   42.17   3.04
         Research Paper (学术论文)               104       67        32   5    64.42   30.77   4.81
Academic Coursework (课程作业)                    90       53        35   2    58.89   38.89   2.22
Writing  Academic Support (学术辅助)              15       11         3   1    73.33   20.00   6.67
(学术文本) Science Outreach (专业科普)                 7        6         1   0    85.71   14.29   0.00
             Subtotal (小计)                   216      137        71    8   63.43   32.87   3.70
Total (总计)                                  3170     1986      1081 103    62.65   34.10   3.25



                                            57
Table 13 | Comparative Analysis of DeepSeek-V4-Pro and Gemini-3.1-Pro in Chinese Creative
Writing.

                                    Instruction Following(指令遵循)                       Writing Quality (写作质量)
Subcategory (文体)               #   DS Gem Tie DS% Gem% Tie%                     DS Gem Tie DS% Gem% Tie%
Fiction (小说故事)               836   504   323    5 60.58        38.82    0.60    672    157    3 80.77     18.87   0.36
General Fiction (泛小说故事)      662   368   290    3 55.67        43.87    0.45    467    194    0 70.65     29.35   0.00
Fan Fiction (同人文)            410   253   150    3 62.32        36.95    0.74    338     67    1 83.25     16.50   0.25
General Fan Fic. (泛同人文)      202   111    90    1 54.95        44.55    0.50    161     40    1 79.70     19.80   0.50
Narrative (记叙文)              171   115    54    2 67.25        31.58    1.17    141     30    0 82.46     17.54   0.00
General Prose (泛散文)          124    83    40    1 66.94        32.26    0.81     88     36    0 70.97     29.03   0.00
Prose (散文)                   112    74    38    0 66.07        33.93    0.00     92     20    0 82.14     17.86   0.00
Writing Style (文笔)           112    81    31    0 72.32        27.68    0.00     86     26    0 76.79     23.21   0.00
Classical Poetry (古诗文)        48    24    24    0 50.00        50.00    0.00     39      9    0 81.25     18.75   0.00
Modern Poetry (现代诗)           43    23    20    0 53.49        46.51    0.00     32     11    0 74.42     25.58   0.00
Lyrics (歌词)                   30     8    22    0 26.67        73.33    0.00     16     14    0 53.33     46.67   0.00
Literary Appreciation (赏析)    27    20     7    0 74.07        25.93    0.00     18      9    0 66.67     33.33   0.00
General Argument. (泛议论文)      24    15     9    0 62.50        37.50    0.00     17      7    0 70.83     29.17   0.00
General Narrative (泛记叙文)      23    11    12    0 47.83        52.17    0.00     15      8    0 65.22     34.78   0.00
General Classical (泛古文诗歌)      9     5     4    0 55.56        44.44    0.00      5      4    0 55.56     44.44   0.00
Creative Writing (创意写作)        6     2     4    0 33.33        66.67    0.00      4      2    0 66.67     33.33   0.00
Argumentative (议论文)            5     5     0    0 100.00        0.00    0.00      5      0    0 100.00     0.00   0.00
General Mod. Poetry (泛现代诗)     2     1     1    0 50.00        50.00    0.00      2      0    0 100.00     0.00   0.00
Total (总计)                   2837 1703 1119    15    60.03     39.44    0.53 2198      634    5   77.48   22.35   0.18




Table 14 | DeepSeek-V4-Pro vs. Claude-Opus-4.5 on Complex Instruction Following and Multi-
Turn Writing.

                                                             Internal Evaluation (内部综合评估)
      Category                                        # DS Opus Tie              DS% Opus% Tie%
      Complex Inst. Following (复杂指令跟随) 49                 23       26          0 46.9%       53.1%   0.0%
      Multi-Turn Writing (多轮写作)        147                67       76          4 45.6%       51.7%   2.7%
      Total (总计)                                    196   90      102          4 45.9%       52.0% 2.0%




                                                    58

>cd /Users/kiran/Downloads/pdf2kindle && python3 build_deepseek_v4.py /Users/kiran/Downloads/DeepSeek_V4.pdf -o deepseek-v4.md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" 2>&1 | tail -10

main()
    ~~~~^^
  File "/Users/kiran/Downloads/pdf2kindle/build_deepseek_v4.py", line 354, in main
    body_md, appendix_idx, eval_idx, paragraphs = build_body(raw_lines)
                                                  ~~~~~~~~~~^^^^^^^^^^^
  File "/Users/kiran/Downloads/pdf2kindle/build_deepseek_v4.py", line 263, in build_body
    appendix_idx = find_appendix_start(paragraphs)
  File "/Users/kiran/Downloads/pdf2kindle/build_deepseek_v4.py", line 221, in find_appendix_start
    raise ValueError("could not locate appendix start")
ValueError: could not locate appendix start

>cd /Users/kiran/Downloads/pdf2kindle && sed -n '3835,3845p' /tmp/dsv4_raw.txt

Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL http
s://openreview.net/forum?id=YrycTjllL0.

53

Appendix
A. Author List and Acknowledgment
A.1. Author List
Authors are listed alphabetically by their first name. Names marked with * denote individuals
who have departed from our team.
Research & Engineering: Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang*, Bingzheng Xu,

I'm noticing the paragraph detection is treating "Appendix" and "A. Author List and Acknowledgment" as separate when they should be grouped together without a blank line between them. Looking at the actual input structure, these lines are part of the same section, so I need to adjust how I'm splitting paragraphs to keep them connected.

The cleanest approach is to iterate through individual lines rather than pre-split paragraphs, treating heading lines themselves as natural paragraph boundaries so I can properly separate the appendix sections.

Editedbuild_deepseek_v4.py

>cd /Users/kiran/Downloads/pdf2kindle && python3 build_deepseek_v4.py /Users/kiran/Downloads/DeepSeek_V4.pdf -o deepseek-v4.md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" 2>&1 | tail -10

Wrote deepseek-v4.md (1828 lines)

>cd /Users/kiran/Downloads/pdf2kindle && rg -n "^#" deepseek-v4.md | head -80

7:# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
12:## 1. Introduction
39:## 2. Architecture
48:### 2.1 Designs Inherited from DeepSeek-V3
54:### 2.2 Manifold-Constrained Hyper-Connections
237:### 2.3 Hybrid Attention with CSA and HCA
243:#### 2.3.1 Compressed Sparse Attention
435:#### 2.3.2 Heavily Compressed Attention
504:#### 2.3.3 Other Details
513:#### 2.3.4 Efficiency Discussion
522:### 2.4 Muon Optimizer
534:## 3. General Infrastructures
537:### 3.1 Fine-Grained Communication-Computation Overlap in Expert Parallelism
621:### 3.2 Flexible and Efficient Kernel Development with TileLang
633:### 3.3 High-Performance Batch-Invariant and Deterministic Kernel Libraries
642:### 3.4 FP4 Quantization-Aware Training
651:### 3.5 Training Framework
657:#### 3.5.1 Efficient Implementation of Muon
663:#### 3.5.2 Cost-Effective and Memory-Efficient Implementation of mHC
669:#### 3.5.3 Contextual Parallelism for Long-Context Attention
675:#### 3.5.4 Extended Automatic Differentiation for Flexible Activation Checkpointing
684:### 3.6 Inference Framework
690:#### 3.6.1 KV Cache Structure and Management
792:#### 3.6.2 On-Disk KV Cache Storage
801:## 4. Pre-Training
804:### 4.1 Data Construction
813:### 4.2 Pre-Training Setups
816:#### 4.2.1 Model Setups
822:#### 4.2.2 Training Setups
831:#### 4.2.3 Mitigating Training Instability
840:### 4.3 Evaluations
843:#### 4.3.1 Evaluation Benchmarks
852:# Shots
894:#### 4.3.2 Evaluation Results
903:## 5. Post-Training
906:### 5.1 Post-Training Pipeline
912:#### 5.1.1 Specialist Training
954:## 2. <think> thinking
969:## Tools You have access to a set of tools to help answer the user’s question. You can invoke tools by writing a "`<|DSML|tool_calls>`" block like the following: `<|DSML|tool_calls>` `<|DSML|invoke name="$TOOL_NAME">` `<|DSML|parameter name="$PARAMETER_NAME" string="true|false">`$PARAMETER_VALUE `</|DSML|parameter>` ... `</|DSML|invoke>` `<|DSML|invoke name="$TOOL_NAME2">` ... `</|DSML|invoke>` `</|DSML|tool_calls>` String parameters should be specified as is and set ‘string="true"‘. For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set ‘string="false"‘. If thinking_mode is enabled (triggered by `<think>`), you MUST output your complete reasoning inside `<think>`...`</think>` BEFORE any tool calls or final response. Otherwise, output directly after `</think>` with tool calls or final response. ### Available Tool Schemas {Tool Definition...} You MUST strictly follow the above definedtool name and parameter schemas to invoke tool calls.
984:#### 5.1.2 On-Policy Distillation
1062:### 5.2 RL and OPD Infrastructures
1068:#### 5.2.1 FP4 Quantization Integration
1074:#### 5.2.2 Efficient Teacher Scheduling for Full-Vocabulary OPD
1080:#### 5.2.3 Preemptible and Fault-Tolerant Rollout Service
1089:#### 5.2.4 Scaling RL Framework for Million-Token Context
1095:#### 5.2.5 Sandbox Infrastructure for Agentic AI
1104:### 5.3 Standard Benchmark Evaluation
1107:#### 5.3.1 Evaluation Setup
1119:#### 5.3.2 Evaluation Results
1347:### 5.4 Performance on Real-World Tasks
1353:#### 5.4.1 Chinese Writing
1359:#### 5.4.2 Search
1365:#### 5.4.3 White-Collar Task
1509:#### 5.4.4 Code Agent
1536:## 6. Conclusion, Limitations, and Future Directions
1548:## A. Agache, M. Brooker, A. Florescu, A. Iordache, A. Liguori, R. Neugebauer, P. Piwonka, and
1557:## B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet,
1563:## A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse,
1566:## A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage,
1572:## A. Cheng, A. Jacovi, A. Globerson, B. Golan, C. Kwong, C. Alberti, C. Tao, E. Ben-David, G. S.
1581:## A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao,
1587:## B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko.
1608:## B. Shahriari, A. Ram’e, J. Ferret, P. Liu, P. D. Tafti, A. Friesen, M. Casbon, S. Ramos, R. Kumar,
1614:## A. Hutchison, A. Abdagic, A. Carl, A. Shen, A. Brock, A. Coenen, A. Laforge, A. Paterson,
1617:## B. Bastian, B. Piot, B. Wu, B. Royal, C. Chen, C. Kumar, C. Perry, C. A. Welty, C. A. ChoquetteChoo, D. Sinopalnikov, D. Weinberger, D. Vijaykumar, D. Rogozi’nska, D. Herbison, E. Bandy,
1626:## B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea,
1635:## A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30,
1641:## B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou,
1654:## A. Author List and Acknowledgment
1657:### A.1 Author List
1666:### A.2 Acknowledgment
1672:## B. Evaluation Details

I'm spotting a few parsing errors in the output. There's a bogus heading at line 852, a leftover instruction fragment at line 954 where the "2." prefix got misinterpreted as a chapter marker, and another false match at line 969 where the Table 4 content triggered the heading regex because it started with ## Tools. I need to refine the pattern matching to avoid these false positives.

The core issue is that the original PDF had literal hash marks in the tool-call schema, which my code didn't escape properly. I should wrap those tables in code blocks to prevent them from being parsed as headings. For the references section, the real problem is that author names like "A. Agache" and "B. Jacob" are matching the appendix heading pattern when they shouldn't. I need to tighten the regex to only recognize actual appendix headers by constraining it to specific titles or requiring the line to be short and in title case, rather than matching any capital letter followed by a period. I'm realizing the issue — references have commas in their author lists, so I can use that to distinguish them from actual appendix headings. Let me hardcode the known appendix section titles instead of relying on regex patterns that keep matching reference lines.

I see the issue now — the PDF table extraction pulled the column header "# Shots" into the text as a separate line, and my heading detection regex mistakenly tried to parse it as a markdown heading, but the pattern doesn't actually match that format.

Looking back at the source, "# Shots" is literally what pdftotext extracted, and since it starts with #, it's being rendered as a markdown heading when it shouldn't be. I need to escape leading # characters in paragraphs that aren't actual headings.

The real problem is that Tables 1 through 8 are getting mangled during extraction — the dense numeric data and column alignment are collapsing into fragmented paragraphs. I should handle these tables the same way I did with Tables 9-14 by wrapping them in code blocks, either by using pdftotext -layout mode or by identifying and isolating the messy table regions directly.

>cd /Users/kiran/Downloads/pdf2kindle && pdftotext -layout -f 28 -l 29 /Users/kiran/Downloads/DeepSeek_V4.pdf - | head -60

Table 1 | Comparison among DeepSeek-V3.2-Base, DeepSeek-V4-Flash-Base, and DeepSeek-V4-
Pro-Base. All models are evaluated in our internal framework and share the same evaluation
setting. Scores with a gap not exceeding 0.3 are considered to be at the same level. The highest
score in each row is in bold font, and the second is underlined.

                                                 DeepSeek-V3.2 DeepSeek-V4-Flash DeepSeek-V4-Pro
               Benchmark (Metric)      # Shots
                                                     Base            Base             Base
               Architecture               -             MoE          MoE              MoE
               # Activated Params         -              37B          13B             49B
               # Total Params             -             671B         284B             1.6T
              AGIEval (EM)             0-shot           80.1         82.6              83.1
              MMLU (EM)                5-shot           87.8         88.7              90.1
              MMLU-Redux (EM)          5-shot           87.5         89.4              90.8
              MMLU-Pro (EM)            5-shot           65.5         68.3              73.5
              MMMLU (EM)               5-shot           87.9         88.8              90.3
              C-Eval (EM)              5-shot           90.4         92.1              93.1
 World Knowl.
              CMMLU (EM)               5-shot           88.9         90.4              90.8
              MultiLoKo (EM)           5-shot           38.7         42.2              51.1
              Simple-QA verified (EM) 25-shot           28.3         30.1              55.2
              SuperGPQA (EM)           5-shot           45.0         46.5              53.9
              FACTS Parametric (EM) 25-shot             27.1         33.9              62.6
              TriviaQA (EM)            5-shot           83.3         82.8              85.6
               BBH (EM)                3-shot           87.6         86.9              87.5
               DROP (F1)               1-shot           88.2         88.6              88.7
 Lang. & Reas. HellaSwag (EM)          0-shot           86.4         85.7              88.0
               WinoGrande (EM)         0-shot           78.9         79.5              81.5
               CLUEWSC (EM)            5-shot           83.5         82.2              85.2
               BigCodeBench (Pass@1)   3-shot           63.9         56.8              59.2
               HumanEval (Pass@1)      0-shot           62.8         69.5              76.8
               GSM8K (EM)              8-shot           91.1         90.8              92.6
 Code & Math
               MATH (EM)               4-shot           60.5         57.4              64.5
               MGSM (EM)               8-shot           81.3         85.7              84.4
               CMath (EM)              3-shot           92.6         93.6              90.9
 Long Context LongBench-V2 (EM)        1-shot           40.2         44.7              51.5



4.3.2. Evaluation Results

In Table 1, we provide a detailed comparison of the base models for DeepSeek-V3.2, DeepSeek-
V4-Flash, and DeepSeek-V4-Pro, all evaluated under a unified internal framework with strictly
consistent settings.
    Comparing DeepSeek-V4-Flash-Base with DeepSeek-V3.2-Base reveals a compelling ef-
ficiency story. Despite utilizing a substantially smaller number of both activated and total
parameters, DeepSeek-V4-Flash-Base outperforms DeepSeek-V3.2-Base across a wide array of
benchmarks. This advantage is especially evident in world knowledge tasks and challenging
long-context scenarios. These results underscore that architectural improvements, refined data
quality, and training optimizations in DeepSeek-V4-Flash-Base yield superior performance even
with a more compact parameter budget, effectively surpassing the larger DeepSeek-V3.2-Base
on the majority of evaluations.
    Furthermore, DeepSeek-V4-Pro-Base demonstrates a further, decisive leap in capability,
establishing near-universal dominance over both DeepSeek-V3.2-Base and DeepSeek-V4-Flash-
Base. With improvements across almost all categories, DeepSeek-V4-Pro-Base reaches new


                                                   28
performance highs among DeepSeek base models on the most demanding benchmarks. On

I'm realizing that wrapping tables in layout-mode preformatted blocks makes them much more readable. My approach is to identify each table region in the source by finding the "Table N |" headers, then extract just those page ranges using pdftotext with layout mode to preserve the formatting. I'll combine this with the raw reflowed text I'm already generating to get the best of both. I'm mapping out the table and figure locations across the document—Tables 1 through 8 are scattered through the main sections, with 9-14 in the appendix. For figures, I just need to keep the captions and drop the fragmented chart data that follows. My best approach is to switch into a skip mode whenever I hit a Figure or Table caption, then resume normal processing once I reach actual prose again. This avoids the post-processing complexity and handles the cleanup in a single pass. I'm refining the paragraph processor to be stateful—when it encounters a figure or table caption, it enters a mode where it skips noise paragraphs until hitting real prose again. For tables specifically, I'll insert a layout-mode code block right after the caption, then drop the fragmented content that follows. To make this work, I need to precompute page numbers for each table by scanning the layout text, though I'm also considering a simpler approach: just drop the table fragments entirely and rely on captions, or preserve the most important tables via layout fallback in an appendix-like section at the end.

Actually, I'm realizing that putting tables at the end breaks reading flow—the reader sees "as shown in Table 6" but has to flip to find it. Better to insert the layout-extracted table inline right after the caption for referenced tables, while dropping figure noise entirely. So my plan is to pre-scan the PDF layout to find each table location and capture its full content, then in the main processor, when I encounter a table caption, I'll replace the following noise with the extracted table in a code block. Now I need to check what the actual table formatting looks like in the layout text.

>cd /Users/kiran/Downloads/pdf2kindle && pdftotext -layout -f 30 -l 32 /Users/kiran/Downloads/DeepSeek_V4.pdf - | head -120

Table 2 | Comparison of three reasoning modes

     Reasoning         Characteristics         Typical Use Cases        Response Format
     Mode
     Non-think         Fast, intuitive re- Routine daily tasks, </think> summary
                       sponses based on emergency reactions,
                       habits or simple low-risk decisions.
                       rules.
     Think High        Conscious logical       Complex problem- <think> thinking
                       analysis, slower but    solving, planning, tokens </think>
                       more accurate.          medium-risk deci- summary
                                               sions.
     Think Max         Push reasoning to its   Exploring the bound- 1. A special system
                       fullest extent. Slow    ary of model reason- prompt at the begin-
                       but powerful.           ing capability.      ning.
                                                                    2. <think> thinking
                                                                    tokens   </think>
                                                                    summary

Table 3 | Instruction injected into the system prompt for the "Think Max" mode.


   Injected Instruction

   Reasoning Effort: Absolute maximum with no shortcuts permitted.
   You MUST be very thorough in your thinking and comprehensively decompose the
   problem to resolve the root cause, rigorously stress-testing your logic against all potential
   paths, edge cases, and adversarial scenarios.
   Explicitly write out your entire deliberation process, documenting every intermediate
   step, considered alternative, and rejected hypothesis to ensure absolutely no assumption
   is left unchecked.



model leverages its own logic to generalize across complex tasks.


Tool-Call Schema and Special Token. Consistent with our previous version, we utilize a
dedicated <think></think> tag to delineate the reasoning path. In DeepSeek-V4 series, we
introduce a new tool-call schema that employs a special "|DSML|" token and utilizes an XML-
based format for tool invocations, as demonstrated in Table 4. Our experiments demonstrate that
the XML format effectively mitigates escaping failures and reduces tool-call errors, providing a
more robust interface for model-tool interactions.


Interleaved Thinking. DeepSeek-V3.2 introduced a context management strategy that retains
reasoning traces across tool-result rounds but discards them upon the arrival of new user mes-
sages. While effective, this still caused unnecessary token waste in complex agentic workflows
— each new user turn would flush all accumulated reasoning content, forcing the model to
reconstruct its problem-solving state from scratch. Leveraging the expanded 1M-token context


                                                30
Table 4 | Tool-call schema for DeepSeek-V4 series.

   Tool Call Schema

       ## Tools

       You have access to a set of tools to help answer the user’s question. You can
            invoke tools by writing a "<|DSML|tool_calls>" block like the following:

       <|DSML|tool_calls>
       <|DSML|invoke name="$TOOL_NAME">
       <|DSML|parameter name="$PARAMETER_NAME" string="true|false">$PARAMETER_VALUE
           </|DSML|parameter>
       ...
       </|DSML|invoke>
       <|DSML|invoke name="$TOOL_NAME2">
       ...
       </|DSML|invoke>
       </|DSML|tool_calls>

       String parameters should be specified as is and set ‘string="true"‘. For all
           other types (numbers, booleans, arrays, objects), pass the value in JSON
           format and set ‘string="false"‘.

       If thinking_mode is enabled (triggered by <think>), you MUST output your
           complete reasoning inside <think>...</think> BEFORE any tool calls or
           final response.

       Otherwise, output directly after </think> with tool calls or final response.

       ### Available Tool Schemas

       {Tool Definition...}

       You MUST strictly follow the above definedtool name and parameter schemas to
           invoke tool calls.




window of DeepSeek-V4 series, we further refine this mechanism to maximize the effectiveness
of interleaved thinking in agentic environments:

   • Tool-Calling Scenarios. As illustrated in Figure 7(a), all reasoning content is fully pre-
     served throughout the entire conversation. Unlike DeepSeek-V3.2, which discarded
     thinking traces upon each new user turn, DeepSeek-V4 series retain the complete reason-
     ing history across all rounds, including across user message boundaries. This allows the
     model to maintain a coherent, cumulative chain of thought over long-horizon agent tasks.
   • General Conversational Scenarios. As illustrated in Figure 7(b), the original strategy is
     preserved: reasoning content from previous turns is discarded when a new user message
     arrives, keeping the context concise for settings where persistent reasoning traces provide
     limited benefit.

As with DeepSeek-V3.2, agent frameworks that simulate tool interactions via user messages (e.g.,
Terminus) may not trigger the tool-calling context path and thus may not benefit from enhanced
reasoning persistence. We continue to recommend non-think models for such architectures.

                                              31
                              a) Thinking with tools




                              b) Thinking without tools

                   Figure 7 | Thinking management of DeepSeek-V4 series.

I'm reconsidering the approach—instead of trying to preserve layout mode across the whole body, I'll extract it for sections 1-6 as a single source, then identify tables as indented multi-line blocks and fence them separately. This should handle Tables 1-8 cleanly without the complications of mixed whitespace handling.

The hybrid strategy is cleaner: use raw extraction for prose (which reflows naturally), but pull tables specifically from layout mode where they stay aligned. I can detect tables by their captions like "Table N |..." and map them back to the layout extraction. Then when processing the raw paragraph stream, I'll swap in the fenced table content when I hit a table caption, skip any noise paragraphs after it, and do the same for figures without inserting anything.

For the references section, I need to switch modes once we hit "References"—stop treating heading-like patterns as paragraph breaks since author names will match those patterns. I'll detect the references region by checking if we've encountered a "References" paragraph, then disable the heading-split logic for everything after that point.

Designing the processor...

I need to handle the References section properly by detecting it as a heading, then restrict appendix matching to known titles like "A. Author List and Acknowledgment" and "B. Evaluation Details" so they're recognized as proper headings rather than paragraph content. I'll also drop the standalone "Appendix" line since the next heading already marks the appendix start, and escape any stray # characters that appear in table fragments.

Let me significantly improve the build script with state-tracking and table handling.

>cd /Users/kiran/Downloads/pdf2kindle && python3 build_deepseek_v4.py /Users/kiran/Downloads/DeepSeek_V4.pdf -o deepseek-v4.md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" 2>&1 | tail -10 && rg -n "^#" deepseek-v4.md | head -100

Wrote deepseek-v4.md
7:# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
775:## Tools You have access to a set of tools to help answer the user’s question. You can invoke tools by writing a "`<|DSML|tool_calls>`" block like the following: `<|DSML|tool_calls>` `<|DSML|invoke name="$TOOL_NAME">` `<|DSML|parameter name="$PARAMETER_NAME" string="true|false">`$PARAMETER_VALUE `</|DSML|parameter>` ... `</|DSML|invoke>` `<|DSML|invoke name="$TOOL_NAME2">` ... `</|DSML|invoke>` `</|DSML|tool_calls>` String parameters should be specified as is and set ‘string="true"‘. For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set ‘string="false"‘. If thinking_mode is enabled (triggered by `<think>`), you MUST output your complete reasoning inside `<think>`...`</think>` BEFORE any tool calls or final response. Otherwise, output directly after `</think>` with tool calls or final response. ### Available Tool Schemas {Tool Definition...} You MUST strictly follow the above definedtool name and parameter schemas to invoke tool calls.
1462:## References
1490:## A. Author List and Acknowledgment
1492:### A.1 Author List
1498:### A.2 Acknowledgment
1502:## B. Evaluation Details

I'm noticing that the section headings have disappeared from the output, though the ## Tools heading is still there. The problem seems to be that when I wrap Table 4 in a code block, the layout extraction is picking up lines starting with ## Tools from the schema content—but those should be protected inside the code block and not parsed as markdown. Let me trace through the markdown output to see what's happening around that section.

I'm noticing the ## Tools heading is appearing after the closing code fence when it shouldn't be there—it looks like my filtering logic isn't catching this stray paragraph, and it should either be swallowed entirely or wrapped into the table structure above it.

Looking at how the table extraction works in layout mode, I'm capturing Table 4 from its caption all the way until the next figure or table caption appears, but that's pulling in prose that comes after the table ends. The real issue is that I need a better way to detect where a table actually terminates—not just by looking for the next caption, but by spotting when a left-aligned paragraph starts at column 0, which signals the end of the indented table region. Some tables might end at page breaks instead, so I need to handle both cases.

Refining the detection logic: a table ends when I encounter a line at column 0 that isn't a caption keyword and looks like prose—something with enough words and sentence-ending punctuation. I'm updating the extraction function to scan for that first non-indented, prose-like line after the table content, which will properly separate Table 4 from the paragraph about the DeepSeek-V4 series that follows it.

Editedbuild_deepseek_v4.py

>cd /Users/kiran/Downloads/pdf2kindle && python3 build_deepseek_v4.py /Users/kiran/Downloads/DeepSeek_V4.pdf -o deepseek-v4.md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" 2>&1 | tail -5 && rg -n "^#" deepseek-v4.md | head -80

Wrote deepseek-v4.md
7:# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
619:## Tools You have access to a set of tools to help answer the user’s question. You can invoke tools by writing a "`<|DSML|tool_calls>`" block like the following: `<|DSML|tool_calls>` `<|DSML|invoke name="$TOOL_NAME">` `<|DSML|parameter name="$PARAMETER_NAME" string="true|false">`$PARAMETER_VALUE `</|DSML|parameter>` ... `</|DSML|invoke>` `<|DSML|invoke name="$TOOL_NAME2">` ... `</|DSML|invoke>` `</|DSML|tool_calls>` String parameters should be specified as is and set ‘string="true"‘. For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set ‘string="false"‘. If thinking_mode is enabled (triggered by `<think>`), you MUST output your complete reasoning inside `<think>`...`</think>` BEFORE any tool calls or final response. Otherwise, output directly after `</think>` with tool calls or final response. ### Available Tool Schemas {Tool Definition...} You MUST strictly follow the above definedtool name and parameter schemas to invoke tool calls.
988:## References
1016:## A. Author List and Acknowledgment
1018:### A.1 Author List
1024:### A.2 Acknowledgment
1028:## B. Evaluation Details

>cd /Users/kiran/Downloads/pdf2kindle && sed -n '492,500p' /tmp/dsv4_raw.txt

3

1. Introduction
The emergence of reasoning models (DeepSeek-AI, 2025; OpenAI, 2024c) has established a
new paradigm of test-time scaling, driving substantial performance gains for Large Language
Models (LLMs). However, this scaling paradigm is fundamentally constrained by the quadratic
computational complexity of the vanilla attention mechanism (Vaswani et al., 2017), which
creates a prohibitive bottleneck for ultra-long contexts and reasoning processes. Concurrently,
the emergence of long-horizon scenarios and tasks — from complex agentic workflows to

Editedbuild_deepseek_v4.py

>cd /Users/kiran/Downloads/pdf2kindle && python3 build_deepseek_v4.py /Users/kiran/Downloads/DeepSeek_V4.pdf -o deepseek-v4.md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" 2>&1 | tail -5 && rg -n "^#" deepseek-v4.md | head -80

Wrote deepseek-v4.md
7:# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
10:## 1. Introduction
34:## 2. Architecture
40:### 2.1 Designs Inherited from DeepSeek-V3
44:### 2.2 Manifold-Constrained Hyper-Connections
166:### 2.3 Hybrid Attention with CSA and HCA
170:#### 2.3.1 Compressed Sparse Attention
296:#### 2.3.2 Heavily Compressed Attention
342:#### 2.3.3 Other Details
348:#### 2.3.4 Efficiency Discussion
354:### 2.4 Muon Optimizer
362:## 3. General Infrastructures
364:### 3.1 Fine-Grained Communication-Computation Overlap in Expert Parallelism
420:### 3.2 Flexible and Efficient Kernel Development with TileLang
428:### 3.3 High-Performance Batch-Invariant and Deterministic Kernel Libraries
434:### 3.4 FP4 Quantization-Aware Training
440:### 3.5 Training Framework
444:#### 3.5.1 Efficient Implementation of Muon
448:#### 3.5.2 Cost-Effective and Memory-Efficient Implementation of mHC
452:#### 3.5.3 Contextual Parallelism for Long-Context Attention
456:#### 3.5.4 Extended Automatic Differentiation for Flexible Activation Checkpointing
462:### 3.6 Inference Framework
466:#### 3.6.1 KV Cache Structure and Management
538:#### 3.6.2 On-Disk KV Cache Storage
544:## 4. Pre-Training
546:### 4.1 Data Construction
552:### 4.2 Pre-Training Setups
554:#### 4.2.1 Model Setups
558:#### 4.2.2 Training Setups
564:#### 4.2.3 Mitigating Training Instability
570:### 4.3 Evaluations
572:#### 4.3.1 Evaluation Benchmarks
628:#### 4.3.2 Evaluation Results
634:## 5. Post-Training
636:### 5.1 Post-Training Pipeline
640:#### 5.1.1 Specialist Training
680:## 2. <think> thinking
741:## Tools You have access to a set of tools to help answer the user’s question. You can invoke tools by writing a "`<|DSML|tool_calls>`" block like the following: `<|DSML|tool_calls>` `<|DSML|invoke name="$TOOL_NAME">` `<|DSML|parameter name="$PARAMETER_NAME" string="true|false">`$PARAMETER_VALUE `</|DSML|parameter>` ... `</|DSML|invoke>` `<|DSML|invoke name="$TOOL_NAME2">` ... `</|DSML|invoke>` `</|DSML|tool_calls>` String parameters should be specified as is and set ‘string="true"‘. For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set ‘string="false"‘. If thinking_mode is enabled (triggered by `<think>`), you MUST output your complete reasoning inside `<think>`...`</think>` BEFORE any tool calls or final response. Otherwise, output directly after `</think>` with tool calls or final response. ### Available Tool Schemas {Tool Definition...} You MUST strictly follow the above definedtool name and parameter schemas to invoke tool calls.
751:#### 5.1.2 On-Policy Distillation
818:### 5.2 RL and OPD Infrastructures
822:#### 5.2.1 FP4 Quantization Integration
826:#### 5.2.2 Efficient Teacher Scheduling for Full-Vocabulary OPD
830:#### 5.2.3 Preemptible and Fault-Tolerant Rollout Service
836:#### 5.2.4 Scaling RL Framework for Million-Token Context
840:#### 5.2.5 Sandbox Infrastructure for Agentic AI
846:### 5.3 Standard Benchmark Evaluation
848:#### 5.3.1 Evaluation Setup
856:#### 5.3.2 Evaluation Results
1020:### 5.4 Performance on Real-World Tasks
1024:#### 5.4.1 Chinese Writing
1028:#### 5.4.2 Search
1032:#### 5.4.3 White-Collar Task
1138:#### 5.4.4 Code Agent
1158:## 6. Conclusion, Limitations, and Future Directions
1164:## References
1192:## A. Author List and Acknowledgment
1194:### A.1 Author List
1200:### A.2 Acknowledgment
1204:## B. Evaluation Details

I see there are still a couple of false positives in the heading detection. The issue is that Table 2 contains a numbered example in its "Response Format" column that starts with "2. thinking", and my heading detection is picking this up as a real heading because it matches the pattern of a numbered item. The problem is that the detection logic is running on the split paragraphs from the raw body text, which doesn't distinguish between actual markdown headings and content that just happens to look like one when it appears in a table.

I'm noticing that my filtering logic isn't catching all the duplicate table content—the raw paragraph text from Table 2 is still leaking through even though it should be filtered out as noise. There's also a malformed heading appearing from the table's raw data that I need to clean up.

The real issue is that I'm only swallowing noise paragraphs immediately after captions, but the raw extraction contains fragmented table content scattered throughout that I've already emitted in layout mode. I need a better boundary detection—instead of trying to identify where the raw table region ends semantically, I should skip forward by counting the paragraphs that the layout region already covered, or use the line positions to align them properly.

The cleanest approach is to have the table extraction also return the first prose line that appears after each table, then use that as a marker in the raw text to know where to stop skallowing fragments. I'll modify the table extraction to capture the opening 60 characters of the next prose line as a boundary prefix, then in the main processor, after emitting a table, skip all raw paragraphs until I encounter that boundary text.

Defining the data structure and implementing the boundary detection...

The tricky part is handling Unicode characters and ensuring the prefix match works even when raw paragraphs are reflowed differently than the layout version, but the boundary line should still start the same way.

Editedbuild_deepseek_v4.py

>cd /Users/kiran/Downloads/pdf2kindle && python3 build_deepseek_v4.py /Users/kiran/Downloads/DeepSeek_V4.pdf -o deepseek-v4.md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" 2>&1 | tail -5 && rg -n "^#" deepseek-v4.md | head -80

Wrote deepseek-v4.md
7:# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
10:## 1. Introduction
34:## 2. Architecture
40:### 2.1 Designs Inherited from DeepSeek-V3
44:### 2.2 Manifold-Constrained Hyper-Connections
166:### 2.3 Hybrid Attention with CSA and HCA
170:#### 2.3.1 Compressed Sparse Attention
296:#### 2.3.2 Heavily Compressed Attention
342:#### 2.3.3 Other Details
348:#### 2.3.4 Efficiency Discussion
354:### 2.4 Muon Optimizer
362:## 3. General Infrastructures
364:### 3.1 Fine-Grained Communication-Computation Overlap in Expert Parallelism
420:### 3.2 Flexible and Efficient Kernel Development with TileLang
428:### 3.3 High-Performance Batch-Invariant and Deterministic Kernel Libraries
434:### 3.4 FP4 Quantization-Aware Training
440:### 3.5 Training Framework
444:#### 3.5.1 Efficient Implementation of Muon
448:#### 3.5.2 Cost-Effective and Memory-Efficient Implementation of mHC
452:#### 3.5.3 Contextual Parallelism for Long-Context Attention
456:#### 3.5.4 Extended Automatic Differentiation for Flexible Activation Checkpointing
462:### 3.6 Inference Framework
466:#### 3.6.1 KV Cache Structure and Management
538:#### 3.6.2 On-Disk KV Cache Storage
544:## 4. Pre-Training
546:### 4.1 Data Construction
552:### 4.2 Pre-Training Setups
554:#### 4.2.1 Model Setups
558:#### 4.2.2 Training Setups
564:#### 4.2.3 Mitigating Training Instability
570:### 4.3 Evaluations
572:#### 4.3.1 Evaluation Benchmarks
582:#### 4.3.2 Evaluation Results
588:## 5. Post-Training
590:### 5.1 Post-Training Pipeline
594:#### 5.1.1 Specialist Training
620:## 2. <think> thinking
689:#### 5.1.2 On-Policy Distillation
716:### 5.2 RL and OPD Infrastructures
720:#### 5.2.1 FP4 Quantization Integration
724:#### 5.2.2 Efficient Teacher Scheduling for Full-Vocabulary OPD
728:#### 5.2.3 Preemptible and Fault-Tolerant Rollout Service
734:#### 5.2.4 Scaling RL Framework for Million-Token Context
738:#### 5.2.5 Sandbox Infrastructure for Agentic AI
744:### 5.3 Standard Benchmark Evaluation
746:#### 5.3.1 Evaluation Setup
754:#### 5.3.2 Evaluation Results
774:### 5.4 Performance on Real-World Tasks
778:#### 5.4.1 Chinese Writing
782:#### 5.4.2 Search
786:#### 5.4.3 White-Collar Task
892:#### 5.4.4 Code Agent
912:## 6. Conclusion, Limitations, and Future Directions
918:## References
946:## A. Author List and Acknowledgment
948:### A.1 Author List
954:### A.2 Acknowledgment
958:## B. Evaluation Details

>cd /Users/kiran/Downloads/pdf2kindle && python3 -c "
import sys
sys.path.insert(0, '.')
from build_deepseek_v4 import extract_tables_layout
from pathlib import Path
tables = extract_tables_layout(Path('/Users/kiran/Downloads/DeepSeek_V4.pdf'))
for n in sorted(tables):
    print(n, repr(tables[n].next_prose_prefix))
"

1 'Pro-Base. All models are evaluated in ou'
2 'Table 3 | Instruction injected into the '
3 'model leverages its own logic to general'
4 'window of DeepSeek-V4 series, we further'
5 '{𝜋𝐸1 , 𝜋𝐸2 , . . . , 𝜋𝐸 𝑁 }, the OPD obj'
6 '"xHigh", and "High" denote reasoning eff'
7 '"High", and "Max" denote reasoning effor'
8 'using DeepSeek-V4-Pro for agentic coding'
9 'Table 10 | Cost Comparison:Agentic Searc'
10 'DeepSeek-V4-Pro. Most of the tool calls '
11 'Category      Subcategory               '
12 'Category     Subcategory                '
13 'Subcategory (文体)               #   DS Ge'
14 ''

I'm noticing an issue with how Table 1's caption is being parsed — the layout extraction is treating wrapped caption lines as prose because they lack indentation, when actually the caption text extends further than expected. Let me check the layout output to see where Table 1 actually ends.

Looking at the other tables, Table 2 correctly ends when Table 3's caption appears, Table 3 and Table 4 have the right prefixes, and Tables 11-13 are picking up header rows from the appendix which I'm handling separately anyway. The real problem is with my parsing order — I'm checking for headings before applying the swallow mode logic, so a heading that appears before a table caption gets emitted even when it should be suppressed.

I need to restructure the loop so that when I'm in table-swallowing mode, I check if the current paragraph matches the prefix for the next table first. If it does, I exit swallow mode and process it normally. If it doesn't match, I skip it entirely. Only after handling the swallow check should I process headings and captions. This way, content trapped between tables won't accidentally get emitted as headings.

Restructuring the parsing loop...

Editedbuild_deepseek_v4.py

>cd /Users/kiran/Downloads/pdf2kindle && python3 build_deepseek_v4.py /Users/kiran/Downloads/DeepSeek_V4.pdf -o deepseek-v4.md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" 2>&1 | tail -3 && rg -n "^#" deepseek-v4.md | head -80

Wrote deepseek-v4.md
7:# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
10:## 1. Introduction
34:## 2. Architecture
40:### 2.1 Designs Inherited from DeepSeek-V3
44:### 2.2 Manifold-Constrained Hyper-Connections
166:### 2.3 Hybrid Attention with CSA and HCA
170:#### 2.3.1 Compressed Sparse Attention
296:#### 2.3.2 Heavily Compressed Attention
342:#### 2.3.3 Other Details
348:#### 2.3.4 Efficiency Discussion
354:### 2.4 Muon Optimizer
362:## 3. General Infrastructures
364:### 3.1 Fine-Grained Communication-Computation Overlap in Expert Parallelism
420:### 3.2 Flexible and Efficient Kernel Development with TileLang
428:### 3.3 High-Performance Batch-Invariant and Deterministic Kernel Libraries
434:### 3.4 FP4 Quantization-Aware Training
440:### 3.5 Training Framework
444:#### 3.5.1 Efficient Implementation of Muon
448:#### 3.5.2 Cost-Effective and Memory-Efficient Implementation of mHC
452:#### 3.5.3 Contextual Parallelism for Long-Context Attention
456:#### 3.5.4 Extended Automatic Differentiation for Flexible Activation Checkpointing
462:### 3.6 Inference Framework
466:#### 3.6.1 KV Cache Structure and Management
538:#### 3.6.2 On-Disk KV Cache Storage
544:## 4. Pre-Training
546:### 4.1 Data Construction
552:### 4.2 Pre-Training Setups
554:#### 4.2.1 Model Setups
558:#### 4.2.2 Training Setups
564:#### 4.2.3 Mitigating Training Instability
570:### 4.3 Evaluations
572:#### 4.3.1 Evaluation Benchmarks

>cd /Users/kiran/Downloads/pdf2kindle && rg -n "^#" deepseek-v4.md

7:# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
10:## 1. Introduction
34:## 2. Architecture
40:### 2.1 Designs Inherited from DeepSeek-V3
44:### 2.2 Manifold-Constrained Hyper-Connections
166:### 2.3 Hybrid Attention with CSA and HCA
170:#### 2.3.1 Compressed Sparse Attention
296:#### 2.3.2 Heavily Compressed Attention
342:#### 2.3.3 Other Details
348:#### 2.3.4 Efficiency Discussion
354:### 2.4 Muon Optimizer
362:## 3. General Infrastructures
364:### 3.1 Fine-Grained Communication-Computation Overlap in Expert Parallelism
420:### 3.2 Flexible and Efficient Kernel Development with TileLang
428:### 3.3 High-Performance Batch-Invariant and Deterministic Kernel Libraries
434:### 3.4 FP4 Quantization-Aware Training
440:### 3.5 Training Framework
444:#### 3.5.1 Efficient Implementation of Muon
448:#### 3.5.2 Cost-Effective and Memory-Efficient Implementation of mHC
452:#### 3.5.3 Contextual Parallelism for Long-Context Attention
456:#### 3.5.4 Extended Automatic Differentiation for Flexible Activation Checkpointing
462:### 3.6 Inference Framework
466:#### 3.6.1 KV Cache Structure and Management
538:#### 3.6.2 On-Disk KV Cache Storage
544:## 4. Pre-Training
546:### 4.1 Data Construction
552:### 4.2 Pre-Training Setups
554:#### 4.2.1 Model Setups
558:#### 4.2.2 Training Setups
564:#### 4.2.3 Mitigating Training Instability
570:### 4.3 Evaluations
572:#### 4.3.1 Evaluation Benchmarks

I'm noticing a problem with the file structure — there are only 32 headings visible, but the rest of the document appears to be missing sections like 4.3.2, 5, and 5.1. It looks like the swallow_table mode for Table 1 might have consumed everything after it, since the table's prefix text doesn't appear in the raw file, causing the swallow function to continue all the way to the end and consume all subsequent headings.

Let me look at what Table 1's actual caption says...

>cd /Users/kiran/Downloads/pdf2kindle && rg -n "Table 1 \|" /tmp/dsv4_raw.txt | head -3

1958:Table 1 | Comparison among DeepSeek-V3.2-Base, DeepSeek-V4-Flash-Base, and DeepSeek-V4Pro-Base. All models are evaluated in our internal framework and share the same evaluation

>cd /Users/kiran/Downloads/pdf2kindle && sed -n '1955,1980p' /tmp/dsv4_raw.txt

27

Table 1 | Comparison among DeepSeek-V3.2-Base, DeepSeek-V4-Flash-Base, and DeepSeek-V4Pro-Base. All models are evaluated in our internal framework and share the same evaluation
setting. Scores with a gap not exceeding 0.3 are considered to be at the same level. The highest
score in each row is in bold font, and the second is underlined.
Benchmark (Metric)

# Shots

Architecture
# Activated Params
# Total Params

-

DeepSeek-V3.2 DeepSeek-V4-Flash DeepSeek-V4-Pro
Base
Base
Base
MoE
37B
671B

MoE
13B

I'm looking at the table caption which describes a comparison across three DeepSeek model variants, noting that scores within 0.3 points are treated as equivalent, with formatting conventions for the highest and second-highest values in each row.

The layout extraction is breaking the caption across multiple lines, and my prose detection is incorrectly terminating the table early because the wrapped caption lines meet the criteria for prose (unindented, multiple words). I need to distinguish between caption text that continues across lines and actual table data. The key insight is that captions end at a blank line, so I should skip past the entire caption block before checking whether we've reached the end of the table.

Refining the approach...

>cd /Users/kiran/Downloads/pdf2kindle && pdftotext -layout /Users/kiran/Downloads/DeepSeek_V4.pdf /tmp/dsv4_layout.txt && rg -n "Table 1 \|" /tmp/dsv4_layout.txt | head -3

1542:Table 1 | Comparison among DeepSeek-V3.2-Base, DeepSeek-V4-Flash-Base, and DeepSeek-V4-

>cd /Users/kiran/Downloads/pdf2kindle && sed -n '1540,1555p' /tmp/dsv4_layout.txt

27
Table 1 | Comparison among DeepSeek-V3.2-Base, DeepSeek-V4-Flash-Base, and DeepSeek-V4-
Pro-Base. All models are evaluated in our internal framework and share the same evaluation
setting. Scores with a gap not exceeding 0.3 are considered to be at the same level. The highest
score in each row is in bold font, and the second is underlined.

                                                 DeepSeek-V3.2 DeepSeek-V4-Flash DeepSeek-V4-Pro
               Benchmark (Metric)      # Shots
                                                     Base            Base             Base
               Architecture               -             MoE          MoE              MoE
               # Activated Params         -              37B          13B             49B
               # Total Params             -             671B         284B             1.6T
              AGIEval (EM)             0-shot           80.1         82.6              83.1
              MMLU (EM)                5-shot           87.8         88.7              90.1
              MMLU-Redux (EM)          5-shot           87.5         89.4              90.8

I'm refining the parsing logic for table detection. The key insight is that a table block starts with a multi-line caption at column 0, followed by a blank line, then indented table data. The block ends when we hit the first column-0 prose line after a blank line. So I need to walk through the text, track whether we've seen at least one blank line, and only then check for the end condition.

Implementing the table termination logic...

Editedbuild_deepseek_v4.py