Classified source PDF and inspected intermediate Markdown.
UnderstandCreated and improved a custom build script with state-tracking and table handling.
BuildBuilt the EPUB file.
BuildPerformed deterministic QA checks and readability spot-checks.
VerifyVerified EPUB TOC structure and heading correctness.
VerifyConvert this research paper to read on kindle paperwhite 7th gen.
/Users/kiran/Downloads/DeepSeek_V4.pdf
I need to understand how this PDF-to-Kindle conversion tool works, so let me start by looking at the project structure and what the README says about the conversion process.
ls /Users/kiran/Downloads/pdf2kindle/__pycache__ _tmp_epub_inspect _tmp_fp_fetch a-close-call-terence-tao.epub a-close-call-terence-tao.md AGENTS.md ASER_2024_Final-Report_13_2_24-1.pdf aws-builders-library.epub build_hybrid_markdown.py compile_fp_collection.py CONVERSION_QA_CHECKLIST.md event-based-concurrency-1.epub extract.py first-principles-socratic-systems-collection.epub first-principles-socratic-systems-collection.md kindle.css License-and-Permits-Brief-v2.epub License-and-Permits-Brief.epub on-keeping-a-notebook.epub pdf_sources pdf2kindle.sh pyproject.toml qa_epub.py qa_shared.py README.md semaphores-kindle.epub turtles-termites-traffic-jams.epub uv.lock
ls -lh /Users/kiran/Downloads/DeepSeek_V4.pdf && pdfinfo /Users/kiran/Downloads/DeepSeek_V4.pdf 2>/dev/null | head -30-rw-r--r--@ 1 kiran staff 4.3M Apr 29 17:21 /Users/kiran/Downloads/DeepSeek_V4.pdf Title: Subject: Keywords: Author: Creator: LaTeX with hyperref Producer: pdfTeX-1.40.27 CreationDate: Fri Apr 24 15:27:37 2026 IST ModDate: Fri Apr 24 15:27:37 2026 IST Custom Metadata: yes Metadata Stream: no Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 58 Encrypted: no Page size: 595.276 x 841.89 pts (A4) Page rot: 0 File size: 4479907 bytes Optimized: yes PDF version: 1.5
I'm looking at a LaTeX research paper that's 58 pages long in A4 format, and I want to check the first page to see what the title and author information are.
cd /Users/kiran/Downloads/pdf2kindle && pdftotext -l 2 /Users/kiran/Downloads/DeepSeek_V4.pdf - | head -80DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence DeepSeek-AI [REDACTED] Abstract We present a preview version of DeepSeek-V4 series, including two strong Mixture-ofExperts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) ManifoldConstrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-ProMax, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeekV4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible. The model checkpoints are available at https://huggingface.co/collections/deepseek-ai/deepseek-v4. DeepSeek-V4-Pro-Max Claude-Opus-4.6-Max 90.2 89.1 85.9 Accuracy / Pass@1 (%) 80 78.1 75.6 GPT-5.4-xHigh Gemini-3.1-Pro-High 32063168 3052 80.680.880.6 75.1 67.9 65.4 68.5 60 57.9 54.6 51.8 46.2 45.3 37.740.039.8 40 47.2 44.4 48.8 0.8 0.6 0.2 0.0 0 (Pass@1)
I'm looking at a research paper from DeepSeek-AI about their V4 model focused on efficient million-token context processing. I'll extract the content from this LaTeX-generated PDF using markdown conversion with the appropriate flags to get it all in one pass, then review and rebuild if needed.
Source classified as: text-based PDF (LaTeX-generated, 58 pages, single column). Running first pass with --keep-md and --no-pause so I can inspect the intermediate Markdown and rebuild after structural cleanup.
cd /Users/kiran/Downloads/pdf2kindle && ./pdf2kindle.sh --no-pause --keep-md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" /Users/kiran/Downloads/DeepSeek_V4.pdf deepseek-v4.epub 2>&1 | tail -80- Suggested fix: Ensure internal fragment links point to real ids in the target document.
34. [Navigation & TOC Validation] Heading anchors/targets exist where needed
- Evidence: Missing fragment target #swa-kv in EPUB/text/ch001.xhtml (linked from EPUB/nav.xhtml)
- Impact: navigation
- Suggested fix: Ensure internal fragment links point to real ids in the target document.
35. [Navigation & TOC Validation] Heading anchors/targets exist where needed
- Evidence: Missing fragment target #swa-kv-1 in EPUB/text/ch001.xhtml (linked from EPUB/nav.xhtml)
- Impact: navigation
- Suggested fix: Ensure internal fragment links point to real ids in the target document.
36. [Navigation & TOC Validation] Heading anchors/targets exist where needed
- Evidence: Missing fragment target #swa-kv-2 in EPUB/text/ch001.xhtml (linked from EPUB/nav.xhtml)
- Impact: navigation
- Suggested fix: Ensure internal fragment links point to real ids in the target document.
37. [Navigation & TOC Validation] Heading anchors/targets exist where needed
- Evidence: Missing fragment target #swa-kv-3 in EPUB/text/ch001.xhtml (linked from EPUB/nav.xhtml)
- Impact: navigation
- Suggested fix: Ensure internal fragment links point to real ids in the target document.
38. [Navigation & TOC Validation] Heading anchors/targets exist where needed
- Evidence: Missing fragment target #csa-kv in EPUB/text/ch001.xhtml (linked from EPUB/nav.xhtml)
- Impact: navigation
- Suggested fix: Ensure internal fragment links point to real ids in the target document.
39. [Navigation & TOC Validation] Heading anchors/targets exist where needed
- Evidence: Missing fragment target #csa-kv-1 in EPUB/text/ch001.xhtml (linked from EPUB/nav.xhtml)
- Impact: navigation
- Suggested fix: Ensure internal fragment links point to real ids in the target document.
40. [Navigation & TOC Validation] Heading anchors/targets exist where needed
- Evidence: Missing fragment target #shots in EPUB/text/ch002.xhtml (linked from EPUB/nav.xhtml)
- Impact: navigation
- Suggested fix: Ensure internal fragment links point to real ids in the target document.
41. [Navigation & TOC Validation] Heading anchors/targets exist where needed
- Evidence: Missing fragment target #bbh-em in EPUB/text/ch002.xhtml (linked from EPUB/nav.xhtml)
- Impact: navigation
- Suggested fix: Ensure internal fragment links point to real ids in the target document.
42. [Navigation & TOC Validation] Heading anchors/targets exist where needed
- Evidence: Missing fragment target #tools in EPUB/text/ch002.xhtml (linked from EPUB/nav.xhtml)
- Impact: navigation
- Suggested fix: Ensure internal fragment links point to real ids in the target document.
43. [Navigation & TOC Validation] Heading anchors/targets exist where needed
- Evidence: Missing fragment target #mrcr-1m-mmr in EPUB/text/ch002.xhtml (linked from EPUB/nav.xhtml)
- Impact: navigation
- Suggested fix: Ensure internal fragment links point to real ids in the target document.
44. [Navigation & TOC Validation] Heading anchors/targets exist where needed
- Evidence: Missing fragment target #mrcr-1mmmr in EPUB/text/ch002.xhtml (linked from EPUB/nav.xhtml)
- Impact: navigation
- Suggested fix: Ensure internal fragment links point to real ids in the target document.
45. [Navigation & TOC Validation] Heading anchors/targets exist where needed
- Evidence: Missing fragment target #none in EPUB/text/ch002.xhtml (linked from EPUB/nav.xhtml)
- Impact: navigation
- Suggested fix: Ensure internal fragment links point to real ids in the target document.
46. [Technical EPUB Validity] EPUB opens cleanly without archive/format errors
- Evidence: Unable to parse XHTML content EPUB/text/ch001.xhtml: not well-formed (invalid token): line 316, column 3
- Impact: technical
- Suggested fix: Ensure content documents are well-formed XHTML.
47. [Technical EPUB Validity] EPUB opens cleanly without archive/format errors
- Evidence: Unable to parse XHTML content EPUB/text/ch002.xhtml: mismatched tag: line 54, column 46
- Impact: technical
- Suggested fix: Ensure content documents are well-formed XHTML.
48. [Artifact Cleanup Sweep] Wrapped URLs and broken inline text repaired
- Evidence: Split URL detected in source markdown: DeepSeek_V4.md
- Impact: readability
- Suggested fix: Normalize wrapped URLs before final build.
49. [Artifact Cleanup Sweep] OCR artifacts spot-cleaned (if OCR used)
- Evidence: Control characters detected in source markdown: DeepSeek_V4.md
- Impact: readability
- Suggested fix: Strip control characters before EPUB generation.
50. [Artifact Cleanup Sweep] Accidental line joins/splits fixed
- Evidence: Split hyphenated words detected in source markdown: deci- summary, reason- prompt
- Impact: readability
- Suggested fix: De-hyphenate wrapped words before final build.
51. [Artifact Cleanup Sweep] Repeated page headers/footers/page numbers removed
- Evidence: Suspicious repeated lines in source markdown: 'Total (总计)' x4; 'Subtotal (小计)' x8
- Impact: readability
- Suggested fix: Inspect for leaked headers/footers or other repeated boilerplate and remove them before final build.
52. [Artifact Cleanup Sweep] Paragraph flow is smooth for e-ink reading
- Evidence: Very long paragraph detected in DeepSeek_V4.md: By employing hybrid CSA and HCA, along with precision optimizations on computation and storage, DeepSeek-V4 series achieve significantly low…
- Impact: readability
- Suggested fix: Split dense paragraphs into smaller Kindle-friendly chunks before final build.
WARNING: Validation found issues (see above). Review before sending to Kindle.
Markdown file kept: DeepSeek_V4.mdI'm seeing a bunch of issues pop up, so let me review the complete list and check through the markdown to understand what's going on.
cd /Users/kiran/Downloads/pdf2kindle && ./pdf2kindle.sh --no-pause --keep-md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" /Users/kiran/Downloads/DeepSeek_V4.pdf deepseek-v4.epub 2>&1 | head -120==> Step 1: Extracting text from PDF...
Extracting text from: /Users/kiran/Downloads/DeepSeek_V4.pdf
Applying cleanup heuristics...
Markdown written to: DeepSeek_V4.md
2352 lines, 59 pages processed
15 headings detected, 0 header/footer patterns stripped, 508 page numbers removed, 0 TOC lines removed
==> Step 3: Converting Markdown to EPUB...
[WARNING] Could not convert TeX math TOOL_NAME"> <|DSML|parameter name=", rendering as TeX:
TOOL_NAME"> <|DSML|parameter name="
^
unexpected '"'
expecting "\\bangle", "\\brace", "\\brack", "\\choose", "\\displaystyle", "\\textstyle", "\\scriptstyle", "\\scriptscriptstyle", "{", "\\operatorname", letter, digit, ".", "!", "'", "''", "'''", "''''", "*", "+", ",", "-", ".", "/", ":", ":=", ";", "<", "=", ">", "?", "@", "~", "_", "^", "'", "\\left", "\\lVert", "\\", "\\hyperref" or end of input
[WARNING] Could not convert TeX math PARAMETER_VALUE </|DSML|parameter> ...
</|DSML|invoke> <|DSML|invoke name=", rendering as TeX:
<|DSML|invoke name="
^
unexpected '"'
expecting "\\bangle", "\\brace", "\\brack", "\\choose", "\\displaystyle", "\\textstyle", "\\scriptstyle", "\\scriptscriptstyle", "{", "\\operatorname", letter, digit, ".", "!", "'", "''", "'''", "''''", "*", "+", ",", "-", ".", "/", ":", ":=", ";", "<", "=", ">", "?", "@", "~", "_", "^", "'", "\\left", "\\lVert", "\\", "\\hyperref" or end of input
Done! EPUB written to: deepseek-v4.epub
Transfer to Kindle via USB or Send to Kindle.
==> Step 4: Validating EPUB...
warning: The `tool.uv.dev-dependencies` field (used in `pyproject.toml`) is deprecated and will be removed in a future release; use `dependency-groups.dev` instead
Conversion QA - Failed Items
1. [Technical EPUB Validity] EPUBCheck: RSC-005
- Evidence: [ERROR] Error while parsing file: Anchors within nav elements must contain text — deepseek-v4.epub/EPUB/nav.xhtml:13:1673
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
2. [Technical EPUB Validity] EPUBCheck: RSC-005
- Evidence: [ERROR] Error while parsing file: Anchors within nav elements must contain text — deepseek-v4.epub/EPUB/nav.xhtml:13:1914
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
3. [Technical EPUB Validity] EPUBCheck: RSC-005
- Evidence: [ERROR] Error while parsing file: element "think" not allowed here; expected the element end-tag, text, element "a", "abbr", "address", "area", "article", "aside", "audio", "b", "bdi", "bdo", "blockquote", "br", "button", "canvas", "cite", "code", "data", "datalist", "del", "details", "dfn", "dialog", "div", "dl", "em", "embed", "epub:switch", "epub:trigger", "fieldset", "figure", "footer", "form", "h1", "h2", "h3", "h4", "h5", "h6", "header", "hgroup", "hr", "i", "iframe", "img", "input", "ins", "kbd", "label", "link", "main", "map", "mark", "menu", "meta", "meter", "nav", "ns1:math", "ns2:svg", "object", "ol", "output", "p", "picture", "pre", "progress", "q", "ruby", "s", "samp", "script", "section", "select", "slot", "small", "span", "strong", "sub", "sup", "table", "template", "textarea", "time", "u", "ul", "var", "video" or "wbr" (with xmlns:ns1="http://www.w3.org/1998/Math/MathML" xmlns:ns2="http://www.w3.org/2000/svg") or an element from another namespace — deepseek-v4.epub/EPUB/text/ch002.xhtml:47:1584
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
4. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:727
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
5. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:801
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
6. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:870
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
7. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:939
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
8. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:1009
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
9. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:1077
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
10. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:1147
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
11. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:1371
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
12. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:1443
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
13. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:1522
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
14. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/nav.xhtml:13:1594
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
15. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:52:54
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
16. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:58:50
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
17. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:64:52
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
18. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:70:52
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
19. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:76:52
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
20. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:82:50
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
21. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:88:52
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
22. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:106:49
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
23. [Technical EPUB Validity] EPUBCheck: RSC-012
- Evidence: [ERROR] Fragment identifier is not defined. — deepseek-v4.epub/EPUB/toc.ncx:112:55
- Impact: technical
- Suggested fix: Refer to the EPUB specification for details.
24. [Technical EPUB Validity] EPUBCheck: RSC-012I'm seeing several parsing issues in the markdown file. The <think> tags from the paper's reasoning examples are being treated as HTML instead of escaped text, and the same problem occurs with the <|DSML|...> token examples that show structured outputs. There are also broken anchor links where heading targets don't match their references throughout the document.
cd /Users/kiran/Downloads/pdf2kindle && wc -l DeepSeek_V4.md && rg -n "^#" DeepSeek_V4.md | head -802351 DeepSeek_V4.md 128:## Architecture 417:## Csa / Hca 585:## 𝑍 = 𝐻 ·𝑊 , 593:## 𝑍 = 𝐻 ·𝑊 , 725:## 𝑍 = 𝐻 ·𝑊 , 944:## Swa Kv 950:## Swa Kv 960:## Swa Kv 980:## Swa Kv 992:## Csa Kv 1011:## Csa Kv 1066:# Shots 1098:## Bbh (Em) 1194:## Tools 1201:### Available Tool Schemas 1348:## Mrcr 1M (Mmr) 1430:## Mrcr 1M(Mmr) 1606:## None 1977:# 2035:# V4 win V3.2 win 2109:# DS win Gem win Tie 2265:# 2331:# DS Opus Tie
The heading detection approach is creating too much noise, so I'm going to set that aside and work directly from the paper's table of contents to manually reconstruct the document structure based on what's actually there.
cd /Users/kiran/Downloads/pdf2kindle && pdftotext -f 3 -l 4 /Users/kiran/Downloads/DeepSeek_V4.pdf -5 Post-Training 29 5.1 Post-Training Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.1.1 Specialist Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.1.2 On-Policy Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 RL and OPD Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2.1 FP4 Quantization Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2.2 Efficient Teacher Scheduling for Full-Vocabulary OPD . . . . . . . . . . . 34 5.2.3 Preemptible and Fault-Tolerant Rollout Service . . . . . . . . . . . . . . . 34 5.2.4 Scaling RL Framework for Million-Token Context . . . . . . . . . . . . . . 35 5.2.5 Sandbox Infrastructure for Agentic AI . . . . . . . . . . . . . . . . . . . . . 35 Standard Benchmark Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Performance on Real-World Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.4.1 Chinese Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.4.2 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.4.3 White-Collar Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.4.4 Code Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2 5.3 5.4 6 Conclusion, Limitations, and Future Directions A Author List and Acknowledgment 44 54 A.1 Author List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 A.2 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 B Evaluation Details 55 3 1. Introduction The emergence of reasoning models (DeepSeek-AI, 2025; OpenAI, 2024c) has established a new paradigm of test-time scaling, driving substantial performance gains for Large Language Models (LLMs). However, this scaling paradigm is fundamentally constrained by the quadratic computational complexity of the vanilla attention mechanism (Vaswani et al., 2017), which creates a prohibitive bottleneck for ultra-long contexts and reasoning processes. Concurrently, the emergence of long-horizon scenarios and tasks — from complex agentic workflows to massive cross-document analysis — has also made efficient support for ultra-long contexts critical for future progress. While recent open-source efforts (Bai et al., 2025a; DeepSeek-AI, 2024; MiniMax, 2025; Qwen, 2025) have advanced general capabilities, this core architectural inefficiency in handling ultra-long sequences remains a key impediment, limiting further gains from test-time scaling and hindering further exploration into long-horizon scenarios and tasks. In order to break the efficiency barrier in ultra-long contexts, we develop the DeepSeek-V4 series, including the preview versions of DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Through architectural innovations, DeepSeek-V4 series achieve a dramatic leap in computational efficiency for processing ultra-long sequences. This breakthrough enables efficient support for a context length of one million tokens, ushering in a new era of million-length contexts for next-generation LLMs. We believe our capability to efficiently handle ultra-long sequences unlocks the next frontier of test-time scaling, paves the way for deeper research into long-horizon tasks, and establishes a necessary foundation for exploring future paradigms like online learning. Compared with the DeepSeek-V3 architecture (DeepSeek-AI, 2024), DeepSeek-V4 series retain the DeepSeekMoE framework (Dai et al., 2024) and Multi-Token Prediction (MTP) strategy, while introducing several key innovations in architecture and optimization. To enhance longcontext efficiency, we design a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). CSA compresses the KV caches along the sequence dimension and then performs DeepSeek Sparse Attention (DSA) (DeepSeekAI, 2025), whereas HCA applies more aggressive compression to the KV caches but keeps dense attention. To strengthen modeling capability, we incorporate Manifold-Constrained Hyper-Connections (mHC) (Xie et al., 2026) that upgrade conventional residual connections. Additionally, we introduce the Muon (Jordan et al., 2024; Liu et al., 2025) optimizer to the training of DeepSeek-V4 series, leading to faster convergence and improved training stability. To enable efficient training and inference for DeepSeek-V4 series as well as productive development, we introduce several infrastructure optimizations. First, we design and implement a single fused kernel for MoE modules that fully overlaps computation, communication, and memory access. Second, we employ TileLang (Wang et al., 2026), a Domain-Specific Language (DSL) to balance development productivity and runtime efficiency. Third, we provide efficient batch-invariant and deterministic kernel libraries to ensure bitwise reproducibility across training and inference. Fourth, we incorporate FP4 quantization-aware training for MoE expert weights and the indexer QK path to reduce memory and computation. Fifth, for the training framework, we extend the autograd framework with tensor-level checkpointing for fine-grained recomputation control; and we enhance training efficiency with a hybrid ZeRO strategy for the Muon optimizer, cost-effective mHC implementations via recomputation and fused kernels, and two-stage contextual parallelism to manage compressed attention. Finally, for the inference framework, we design a heterogeneous KV cache structure with on-disk storage strategies to enable efficient shared-prefix reuse. 4
cd /Users/kiran/Downloads/pdf2kindle && pdftotext -f 2 -l 3 /Users/kiran/Downloads/DeepSeek_V4.pdf -Contents 1 Introduction 4 2 Architecture 6 2.1 Designs Inherited from DeepSeek-V3 . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Manifold-Constrained Hyper-Connections . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Hybrid Attention with CSA and HCA . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Compressed Sparse Attention . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Heavily Compressed Attention . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.3 Other Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.4 Efficiency Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Muon Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 3 General Infrastructures 15 3.1 Fine-Grained Communication-Computation Overlap in Expert Parallelism . . . . 15 3.2 Flexible and Efficient Kernel Development with TileLang . . . . . . . . . . . . . . 16 3.3 High-Performance Batch-Invariant and Deterministic Kernel Libraries . . . . . . 18 3.4 FP4 Quantization-Aware Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5 Training Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5.1 Efficient Implementation of Muon . . . . . . . . . . . . . . . . . . . . . . . 20 3.5.2 Cost-Effective and Memory-Efficient Implementation of mHC . . . . . . . 21 3.5.3 Contextual Parallelism for Long-Context Attention . . . . . . . . . . . . . 21 3.5.4 Extended Automatic Differentiation for Flexible Activation Checkpointing 21 3.6 4 Inference Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.6.1 KV Cache Structure and Management . . . . . . . . . . . . . . . . . . . . . 22 3.6.2 On-Disk KV Cache Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Pre-Training 24 4.1 Data Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 Pre-Training Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2.1 Model Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2.2 Training Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2.3 Mitigating Training Instability . . . . . . . . . . . . . . . . . . . . . . . . . 26 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.1 Evaluation Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 2 5 Post-Training 29 5.1 Post-Training Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.1.1 Specialist Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.1.2 On-Policy Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 RL and OPD Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2.1 FP4 Quantization Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2.2 Efficient Teacher Scheduling for Full-Vocabulary OPD . . . . . . . . . . . 34 5.2.3 Preemptible and Fault-Tolerant Rollout Service . . . . . . . . . . . . . . . 34 5.2.4 Scaling RL Framework for Million-Token Context . . . . . . . . . . . . . . 35 5.2.5 Sandbox Infrastructure for Agentic AI . . . . . . . . . . . . . . . . . . . . . 35 Standard Benchmark Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Performance on Real-World Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.4.1 Chinese Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.4.2 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.4.3 White-Collar Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.4.4 Code Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2 5.3 5.4 6 Conclusion, Limitations, and Future Directions A Author List and Acknowledgment 44 54 A.1 Author List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 A.2 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 B Evaluation Details 55 3
Now I'm examining the source markdown to identify structural issues, starting with the first section of the document.
I'm seeing that the table of contents and figure captions are cluttering the output, so I need to clean this up by removing the redundant TOC from the body since it's already in the metadata, stripping out figure labels and axis annotations, using the paper's actual structure to rebuild the headings properly, and making sure to escape any HTML-like syntax that might cause parsing issues.
Given the complexity of a 58-page LaTeX-generated PDF, I'm weighing two approaches: manually reconstructing the semantics using the paper's TOC structure, or applying targeted transformations to the extracted text. Since the extraction is mostly solid with just specific problem areas, I think the better move is to re-extract the raw text and build a custom reconstruction script that uses the paper's TOC as a scaffold to rebuild the markdown cleanly, stripping out the TOC pages and figure noise along the way.
cd /Users/kiran/Downloads/pdf2kindle && pdftotext -layout /Users/kiran/Downloads/DeepSeek_V4.pdf /tmp/dsv4_layout.txt && wc -l /tmp/dsv4_layout.txt3130 /tmp/dsv4_layout.txt
cd /Users/kiran/Downloads/pdf2kindle && pdftotext /Users/kiran/Downloads/DeepSeek_V4.pdf /tmp/dsv4_raw.txt && wc -l /tmp/dsv4_raw.txt && rg -n "^[0-9]\.[0-9].*\." /tmp/dsv4_raw.txt | head -304905 /tmp/dsv4_raw.txt 202:2.3.1 208:2.3.2 214:2.3.3 220:2.3.4 267:3.5.1 273:3.5.2 279:3.5.3 285:3.5.4 297:3.6.1 303:3.6.2 325:4.2.1 331:4.2.2 337:4.2.3 347:4.3.1 353:4.3.2 375:5.1.1 381:5.1.2 391:5.2.1 397:5.2.2 403:5.2.3 409:5.2.4 415:5.2.5 425:5.3.1 431:5.3.2 441:5.4.1 447:5.4.2 453:5.4.3 459:5.4.4 635:2.1. Designs Inherited from DeepSeek-V3 650:2.2. Manifold-Constrained Hyper-Connections
cd /Users/kiran/Downloads/pdf2kindle && rg -n "^[0-9]+\.[0-9]+\.\s+[A-Z]" /tmp/dsv4_raw.txt | head -60635:2.1. Designs Inherited from DeepSeek-V3 650:2.2. Manifold-Constrained Hyper-Connections 837:2.3. Hybrid Attention with CSA and HCA 1239:2.4. Muon Optimizer 1268:3.1. Fine-Grained Communication-Computation Overlap in Expert Parallelism 1393:3.2. Flexible and Efficient Kernel Development with TileLang 1451:3.3. High-Performance Batch-Invariant and Deterministic Kernel Libraries 1508:3.4. FP4 Quantization-Aware Training 1537:3.5. Training Framework 1630:3.6. Inference Framework 1812:4.1. Data Construction 1832:4.2. Pre-Training Setups 1940:4.3. Evaluations 2150:5.1. Post-Training Pipeline 2506:5.3. Standard Benchmark Evaluation 3178:5.4. Performance on Real-World Tasks
cd /Users/kiran/Downloads/pdf2kindle && rg -n "^[0-9]+\.\s+[A-Z]" /tmp/dsv4_raw.txt | head -30622:2. Architecture 1811:4. Pre-Training 1859:128. We employ MoE layers in all Transformer blocks, but use the Hash routing strategy for the 2149:5. Post-Training 3443:6. Conclusion, Limitations, and Future Directions 3545:07372. URL https://doi.org/10.48550/arXiv.2601.07372. 3587:2378. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1246. URL 3742:2021. URL https://proceedings.neurips.cc/paper/2021/hash/92bf5e6240737 3755:2019. URL http://arxiv.org/abs/1911.02150. 3771:2024. URL https://arxiv.org/abs/2407.11214.
I notice that "General Infrastructures" seems to be missing from the list or isn't properly separated on its own line, so let me check the formatting again.
cd /Users/kiran/Downloads/pdf2kindle && rg -n "General Infrastructures|^1\. " /tmp/dsv4_raw.txt | head -10233:General Infrastructures 1267:3. General Infrastructures
cd /Users/kiran/Downloads/pdf2kindle && rg -n "^1\. Introduction|^A\.|^A Author|^B Evaluation|Acknowledgment$|^A.[12]" /tmp/dsv4_raw.txt475:A Author List and Acknowledgment 480:A.1 Author List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484:A.2 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488:B Evaluation Details 3481:A. Agache, M. Brooker, A. Florescu, A. Iordache, A. Liguori, R. Neugebauer, P. Piwonka, and 3529:A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, 3530:A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, 3539:A. Cheng, A. Jacovi, A. Globerson, B. Golan, C. Kwong, C. Alberti, C. Tao, E. Ben-David, G. S. 3716:A. Hutchison, A. Abdagic, A. Carl, A. Shen, A. Brock, A. Coenen, A. Laforge, A. Paterson, 3772:A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 3841:A. Author List and Acknowledgment 3842:A.1. Author List 3892:A.2. Acknowledgment
cd /Users/kiran/Downloads/pdf2kindle && rg -n "Introduction" /tmp/dsv4_raw.txt | head -10176:Introduction 494:1. Introduction
cd /Users/kiran/Downloads/pdf2kindle && rg -n "^B\.|^B Evaluation Details|^B\. Evaluation" /tmp/dsv4_raw.txt | head -10488:B Evaluation Details 3527:B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, 3618:B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. 3714:B. Shahriari, A. Ram’e, J. Ferret, P. Liu, P. D. Tafti, A. Friesen, M. Casbon, S. Ramos, R. Kumar, 3717:B. Bastian, B. Piot, B. Wu, B. Royal, C. Chen, C. Kumar, C. Perry, C. A. Welty, C. A. ChoquetteChoo, D. Sinopalnikov, D. Weinberger, D. Vijaykumar, D. Rogozi’nska, D. Herbison, E. Bandy, 3744:B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea, 3796:B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou, 3896:B. Evaluation Details
cd /Users/kiran/Downloads/pdf2kindle && sed -n '3895,3905p' /tmp/dsv4_raw.txtB. Evaluation Details Table 9 | Agentic Search vs. Retrieval Augmented Search for DeepSeek-V4-Pro. Difficulty Category # Agent Win RAG Win Tie Agent% RAG% Tie% Easy
I'm settling on a concrete approach now. I'll create a Python script that processes the raw PDF text output to build clean markdown—using pdftotext without layout preservation to maintain proper reading order, then skipping the table of contents pages and starting from the introduction section, with regex patterns to detect and format major section headings appropriately.
The script will also handle cleanup tasks like removing page numbers, reflowing wrapped paragraphs, fixing hyphenated words split across lines, escaping XML-like tags that might confuse pandoc, stripping repeated headers and footers, and carefully removing orphaned figure caption fragments while preserving actual captions. I need to verify exactly how the section numbers and titles are laid out in the raw text before finalizing the regex patterns.
cd /Users/kiran/Downloads/pdf2kindle && sed -n '618,665p' /tmp/dsv4_raw.txtachieves comparable results on reasoning tasks when allocated a larger thinking budget. In agent evaluations, while DeepSeek-V4-Flash-Max matches the performance of DeepSeek-V4-Pro-Max on several benchmarks, it still trails its larger counterpart on more complex, high-difficulty tasks. 2. Architecture Overall, DeepSeek-V4 series retain the Transformer (Vaswani et al., 2017) architecture and MultiToken Prediction (MTP) modules (DeepSeek-AI, 2024; Gloeckle et al., 2024), while introducing several key upgrades over DeepSeek-V3: (1) firstly, we introduce the Manifold-Constrained Hyper-Connections (mHC) (Xie et al., 2026) to strengthen conventional residual connections; 6 (2) secondly, we design a hybrid attention architecture, which greatly improves long-context efficiency through Compressed Sparse Attention and Heavily Compressed Attention. (3) thirdly, we employ Muon (Jordan et al., 2024; Liu et al., 2025) as the optimizer. For the Mixture-ofExperts (MoE) components, we still adopt the DeepSeekMoE (Dai et al., 2024) architecture, with only minor adjustments from DeepSeek-V3. The Multi-Token Prediction (MTP) (DeepSeek-AI, 2024; Gloeckle et al., 2024; Li et al., 2024; Qi et al., 2020) configuration remains identical to that of DeepSeek-V3. All other unspecified details follow the settings established in DeepSeekV3 (DeepSeek-AI, 2024). Figure 2 illustrates the overall architecture of DeepSeek-V4, and the details are described below. 2.1. Designs Inherited from DeepSeek-V3 Mixture-of-Experts. As previous DeepSeek-series models (DeepSeek-AI, 2024; DeepSeek-AI, 2024), DeepSeek-V4 series also adopt the DeepSeekMoE paradigm (Dai et al., 2024) for FeedForward Networks (FFNs), which sets fine-grained routed experts and shared experts. Different from DeepSeek-V3, we change the activation function that computes the affinity scores from Sigmoid(·) into Sqrt(Softplus(·)). For load balancing, we also employ the auxiliary-loss-free strategy (DeepSeek-AI, 2024; Wang et al., 2024a), augmented by a slight sequence-wise balance loss that prevents extreme imbalance within individual sequences. For DeepSeek-V4, we remove the constraint on the number of routing target nodes, and carefully redesign the parallelism strategy to maintain training efficiency. Furthermore, compared with DeepSeek-V3, we replace the dense FFN layers in the initial several Transformer blocks with MoE layers that employ Hash routing (Roller et al., 2021). The Hash routing strategy determines the target experts of each token according to a predefined hash function with regard to the input token ID. Multi-Token Prediction. As DeepSeek-V3, DeepSeek-V4 series also set MTP modules and objectives. Given that the MTP strategy has been validated in DeepSeek-V3, we adopt the same strategy for DeepSeek-V4 series without modification. 2.2. Manifold-Constrained Hyper-Connections As shown in Figure 2, DeepSeek-V4 series incorporate Manifold-Constrained Hyper-Connections (mHC) (Xie et al., 2026) to strengthen the conventional residual connections between adjacent Transformer blocks. Compared with naive Hyper-Connections (HC) (Zhu et al., 2025), the core idea of mHC is to constrain the residual mapping onto a specific manifold, and thus enhance the stability of signal propagation across layers while preserving model expressivity. This subsection briefly introduces the standard HC and describes how we design mHC for stable training. Standard Hyper-Connections. The standard HC expands the width of the residual stream by a factor of 𝑛hc . Specifically, the shape of the residual stream is expanded from R𝑑 to R𝑛hc × 𝑑 , where 𝑑 is the hidden size of the actual layer input. Let 𝑋𝑙 = [x𝑙,1 ; . . . ; x𝑙,𝑛hc ] 𝑇 ∈ R𝑛hc × 𝑑 be the residual state before the 𝑙-th layer. HC introduces three linear mappings: an input mapping 𝐴𝑙 ∈ R1× 𝑛hc , a residual transformation 𝐵𝑙 ∈ R𝑛hc × 𝑛hc , and an output mapping 𝐶 𝑙 ∈ R𝑛hc ×1 . The update of the residual state is then formulated as: 𝑋𝑙+1 = 𝐵𝑙 𝑋𝑙 + 𝐶 𝑙 F𝑙 ( 𝐴𝑙 𝑋𝑙 ), (1)
I'm noticing the section headings in this document follow a specific pattern—some appear inline like 2. Architecture, while others like 2.3.1 seem to have the title on the next line, so I should verify which format is being used consistently.
cd /Users/kiran/Downloads/pdf2kindle && sed -n '895,910p' /tmp/dsv4_raw.txtComp 𝐶𝑖 = 𝑚 ( 𝑖∑︁ +1) −1 𝑆 𝑎𝑗 ⊙ 𝐶 𝑎𝑗 + 𝑗=𝑚𝑖 𝑚𝑖 −1
cd /Users/kiran/Downloads/pdf2kindle && rg -n "^2\.3\.1" /tmp/dsv4_raw.txt | head -3202:2.3.1 849:2.3.1. Compressed Sparse Attention
cd /Users/kiran/Downloads/pdf2kindle && rg -n "^[0-9]+\.[0-9]+\.[0-9]+\." /tmp/dsv4_raw.txt | head -2545:80.680.880.6 61:37.740.039.8 849:2.3.1. Compressed Sparse Attention 1078:2.3.2. Heavily Compressed Attention 1153:2.3.3. Other Details 1195:2.3.4. Efficiency Discussion 1543:3.5.1. Efficient Implementation of Muon 1586:3.5.3. Contextual Parallelism for Long-Context Attention 1604:3.5.4. Extended Automatic Differentiation for Flexible Activation Checkpointing 1633:3.6.1. KV Cache Structure and Management 1774:3.6.2. On-Disk KV Cache Storage 1833:4.2.1. Model Setups 1866:4.2.2. Training Setups 1903:4.2.3. Mitigating Training Instability 1941:4.3.1. Evaluation Benchmarks 2127:4.3.2. Evaluation Results 2154:5.1.1. Specialist Training 2306:5.1.2. On-Policy Distillation 2407:5.2.1. FP4 Quantization Integration 2414:5.2.2. Efficient Teacher Scheduling for Full-Vocabulary OPD 2433:5.2.3. Preemptible and Fault-Tolerant Rollout Service 2453:5.2.4. Scaling RL Framework for Million-Token Context 2463:5.2.5. Sandbox Infrastructure for Agentic AI 2507:5.3.1. Evaluation Setup 3185:5.4.1. Chinese Writing
Looking at the document structure, I notice some h3 headings like 3.5.2 and 5.4.x seem to be missing, though they might just be wrapped across multiple lines due to their length.
cd /Users/kiran/Downloads/pdf2kindle && rg -n "^3\.5\.2|Cost-Effective and Memory-Efficient" /tmp/dsv4_raw.txt | head -5273:3.5.2 275:Cost-Effective and Memory-Efficient Implementation of mHC . . . . . . . 1573:3.5.2. Cost-Effective and Memory-Efficient Implementation of mHC
cd /Users/kiran/Downloads/pdf2kindle && rg -n "^5\.4\.[0-9]|^5\.3\.2" /tmp/dsv4_raw.txt | head -8431:5.3.2 441:5.4.1 447:5.4.2 453:5.4.3 459:5.4.4 3185:5.4.1. Chinese Writing 3226:5.4.3. White-Collar Task
cd /Users/kiran/Downloads/pdf2kindle && rg -n "Search$|Code Agent|White-Collar" /tmp/dsv4_raw.txt | head -10 && rg -n "^5\.3\.2|Evaluation Results" /tmp/dsv4_raw.txt | head -10455:White-Collar Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461:Code Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3207:5.4.2. Search 3226:5.4.3. White-Collar Task 3397:5.4.4. Code Agent 3985:V4 Agentic Search 3986:V4 Retrieval Augmented Search 4002:Search 355:Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431:5.3.2 433:Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569:Summary of Core Evaluation Results 2127:4.3.2. Evaluation Results 2570:5.3.2. Evaluation Results
I'm identifying the heading patterns in the markdown—chapters use single-level numbering, sections add a second level, subsections add a third, and appendices start with a letter instead.
Now I need to figure out where the actual content starts by skipping the table of contents and cover pages, which take up the first several hundred lines.
cd /Users/kiran/Downloads/pdf2kindle && sed -n '485,500p' /tmp/dsv4_raw.txt55 B Evaluation Details 55 3 1. Introduction The emergence of reasoning models (DeepSeek-AI, 2025; OpenAI, 2024c) has established a new paradigm of test-time scaling, driving substantial performance gains for Large Language Models (LLMs). However, this scaling paradigm is fundamentally constrained by the quadratic computational complexity of the vanilla attention mechanism (Vaswani et al., 2017), which creates a prohibitive bottleneck for ultra-long contexts and reasoning processes. Concurrently, the emergence of long-horizon scenarios and tasks — from complex agentic workflows to
cd /Users/kiran/Downloads/pdf2kindle && sed -n '1,180p' /tmp/dsv4_raw.txtDeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence DeepSeek-AI [REDACTED] Abstract We present a preview version of DeepSeek-V4 series, including two strong Mixture-ofExperts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) ManifoldConstrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-ProMax, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeekV4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible. The model checkpoints are available at https://huggingface.co/collections/deepseek-ai/deepseek-v4. DeepSeek-V4-Pro-Max Claude-Opus-4.6-Max 90.2 89.1 85.9 Accuracy / Pass@1 (%) 80 78.1 75.6 GPT-5.4-xHigh Gemini-3.1-Pro-High 32063168 3052 80.680.880.6 75.1 67.9 65.4 68.5 60 57.9 54.6 51.8 46.2 45.3 37.740.039.8 40 47.2 44.4 48.8 0.8 0.6 0.2 0.0 0 (Pass@1) HLE (Pass@1) Apex Codeforces SWE Shortlist Verified (Rating) (Pass@1) Knowledge & Reasoning (Resolved) Terminal Bench 2.0 (Acc) Toolathlon (Pass@1) Agentic Capabilities 256 512 768 Token Position (K) 1024 DeepSeek-V3.2 DeepSeek-V4-Pro DeepSeek-V4-Flash 40 SimpleQA Verified 3.7× lower 9.8× lower 0.4 50 20 0 DeepSeek-V3.2 DeepSeek-V4-Pro DeepSeek-V4-Flash 1.0 Accumulated KV Cache (GB) 100 Single-Token FLOPs (T) 1.2 30 9.5× smaller 20 13.7× smaller 10 0 0 256 512 768 Sequence Length (K) 1024 Figure 1 | Left: benchmark performance of DeepSeek-V4-Pro-Max and its counterparts. Right: inference FLOPs and KV cache size of DeepSeek-V4 series and DeepSeek-V3.2. Contents 1 Introduction 4 2 Architecture
I'm seeing the document structure now—the body content starts around line 494 with the introduction, and everything before that is front matter like the title, abstract, and table of contents. Looking at how figures are positioned, there's a consistent pattern where figure captions appear between the text and the underlying chart data. The captions themselves carry important information, so I should preserve those while filtering out the raw axis fragments and chart metadata that follow them.
cd /Users/kiran/Downloads/pdf2kindle && rg -n "^Figure [0-9]+ \||^Table [0-9]+ \|" /tmp/dsv4_raw.txt | head -20170:Figure 1 | Left: benchmark performance of DeepSeek-V4-Pro-Max and its counterparts. Right: 605:Figure 2 | Overall architecture of DeepSeek-V4 series. We use hybrid CSA (Compressed Sparse 833:Figure 3 | Core architectures of CSA. It compresses the number of KV entries to 𝑚1 times, and 1037:Figure 4 | Core architectures of HCA. It performs heavier compression, where the KV entries of 1340:Figure 5 | Illustration of our EP scheme with related works. Comet (Zhang et al., 2025b) overlaps 1748:Figure 6 | Illustration of the KV cache Layout for DeepSeek-V4. The KV cache is organized into 2227:Table 3 | Instruction injected into the system prompt for the "Think Max" mode. 2295:Figure 7 | Thinking management of DeepSeek-V4 series. 2571:Table 6 | Comparison between DeepSeek-V4-Pro-Max and closed/open source models. "Max", 3017:Figure 8 | Formal reasoning under practical and frontier regimes. Left: Putnam-200 Pass@8 3081:Figure 9 | DeepSeek-V4 series performance on the MRCR task. 3175:Figure 10 | HLE and Terminal Bench 2.0 performance by reasoning effort. “None” indicates 3374:Figure 11 | Win-rate comparison across analysis, generation, editing tasks, and the overall 3390:Figure 12 | Detailed dimension scores including Task Completion, Content Quality, Formatting Aesthetics, and Instruction Following. 3392:Figure 13 | Example output of a task which requires drafting a joint marketing proposal for a 3405:Table 8 | Comparison on R&D Coding Benchmark (external models included strictly for evaluation purposes). 3897:Table 9 | Agentic Search vs. Retrieval Augmented Search for DeepSeek-V4-Pro. 3966:Table 10 | Cost Comparison:Agentic Search vs. Retrieval Augmented Search (Mean) for 3988:Table 11 | Comparative Evaluation of DeepSeek-V4-Pro and DeepSeek-V3.2 on Search Q&A 4143:Figure 15 | Example output of a task which requires researching 2020-2025 Nobel Science Prizes
cd /Users/kiran/Downloads/pdf2kindle && rg -n "Table [0-9]+ \||^Table [0-9]+\b" /tmp/dsv4_raw.txt | head -201958:Table 1 | Comparison among DeepSeek-V3.2-Base, DeepSeek-V4-Flash-Base, and DeepSeek-V4Pro-Base. All models are evaluated in our internal framework and share the same evaluation 2184:Table 2 | Comparison of three reasoning modes 2227:Table 3 | Instruction injected into the system prompt for the "Think Max" mode. 2249:Table 4 | Tool-call schema for DeepSeek-V4 series. 2315:Table 5 | Quick Instruction special tokens for auxiliary tasks. 2571:Table 6 | Comparison between DeepSeek-V4-Pro-Max and closed/open source models. "Max", 2769:Table 7 | Comparison among different sizes and modes of DeepSeek-V4 series. "Non-Think", 3195:Table 13 presents the creative writing comparison, which is evaluated along two axes: 3212:Table 11, DeepSeek-V4-Pro outperforms DeepSeek-V3.2 by a substantial margin, demonstrating 3405:Table 8 | Comparison on R&D Coding Benchmark (external models included strictly for evaluation purposes). 3897:Table 9 | Agentic Search vs. Retrieval Augmented Search for DeepSeek-V4-Pro. 3966:Table 10 | Cost Comparison:Agentic Search vs. Retrieval Augmented Search (Mean) for 3988:Table 11 | Comparative Evaluation of DeepSeek-V4-Pro and DeepSeek-V3.2 on Search Q&A 4148:Table 12 | Comparative Analysis of DeepSeek-V4-Pro and Gemini-3.1-Pro in Chinese Functional 4600:Table 13 | Comparative Analysis of DeepSeek-V4-Pro and Gemini-3.1-Pro in Chinese Creative 4865:Table 14 | DeepSeek-V4-Pro vs. Claude-Opus-4.5 on Complex Instruction Following and MultiTurn Writing.
cd /Users/kiran/Downloads/pdf2kindle && sed -n '2225,2310p' /tmp/dsv4_raw.txtsummary
Table 3 | Instruction injected into the system prompt for the "Think Max" mode.
Injected Instruction
Reasoning Effort: Absolute maximum with no shortcuts permitted.
You MUST be very thorough in your thinking and comprehensively decompose the
problem to resolve the root cause, rigorously stress-testing your logic against all potential
paths, edge cases, and adversarial scenarios.
Explicitly write out your entire deliberation process, documenting every intermediate
step, considered alternative, and rejected hypothesis to ensure absolutely no assumption
is left unchecked.
model leverages its own logic to generalize across complex tasks.
Tool-Call Schema and Special Token. Consistent with our previous version, we utilize a
dedicated <think></think> tag to delineate the reasoning path. In DeepSeek-V4 series, we
introduce a new tool-call schema that employs a special "|DSML|" token and utilizes an XMLbased format for tool invocations, as demonstrated in Table 4. Our experiments demonstrate that
the XML format effectively mitigates escaping failures and reduces tool-call errors, providing a
more robust interface for model-tool interactions.
Interleaved Thinking. DeepSeek-V3.2 introduced a context management strategy that retains
reasoning traces across tool-result rounds but discards them upon the arrival of new user messages. While effective, this still caused unnecessary token waste in complex agentic workflows
— each new user turn would flush all accumulated reasoning content, forcing the model to
reconstruct its problem-solving state from scratch. Leveraging the expanded 1M-token context
30
Table 4 | Tool-call schema for DeepSeek-V4 series.
Tool Call Schema
## Tools
You have access to a set of tools to help answer the user’s question. You can
invoke tools by writing a "<|DSML|tool_calls>" block like the following:
<|DSML|tool_calls>
<|DSML|invoke name="$TOOL_NAME">
<|DSML|parameter name="$PARAMETER_NAME" string="true|false">$PARAMETER_VALUE
</|DSML|parameter>
...
</|DSML|invoke>
<|DSML|invoke name="$TOOL_NAME2">
...
</|DSML|invoke>
</|DSML|tool_calls>
String parameters should be specified as is and set ‘string="true"‘. For all
other types (numbers, booleans, arrays, objects), pass the value in JSON
format and set ‘string="false"‘.
If thinking_mode is enabled (triggered by <think>), you MUST output your
complete reasoning inside <think>...</think> BEFORE any tool calls or
final response.
Otherwise, output directly after </think> with tool calls or final response.
### Available Tool Schemas
{Tool Definition...}
You MUST strictly follow the above definedtool name and parameter schemas to
invoke tool calls.
window of DeepSeek-V4 series, we further refine this mechanism to maximize the effectiveness
of interleaved thinking in agentic environments:
• Tool-Calling Scenarios. As illustrated in Figure 7(a), all reasoning content is fully preserved throughout the entire conversation. Unlike DeepSeek-V3.2, which discarded
thinking traces upon each new user turn, DeepSeek-V4 series retain the complete reasoning history across all rounds, including across user message boundaries. This allows the
model to maintain a coherent, cumulative chain of thought over long-horizon agent tasks.
• General Conversational Scenarios. As illustrated in Figure 7(b), the original strategy is
preserved: reasoning content from previous turns is discarded when a new user message
arrives, keeping the context concise for settings where persistent reasoning traces provide
limited benefit.
As with DeepSeek-V3.2, agent frameworks that simulate tool interactions via user messages (e.g.,
Terminus) may not trigger the tool-calling context path and thus may not benefit from enhanced
reasoning persistence. We continue to recommend non-think models for such architectures.
31
a) Thinking with tools
b) Thinking without tools
Figure 7 | Thinking management of DeepSeek-V4 series.
Quick Instruction. In chatbot scenarios, a number of auxiliary tasks (e.g., determining whether
to trigger a web search, intent recognition, etc.) must be executed before generating the response.
Conventionally, these tasks are handled by a separate small model, requiring redundant prefilling since it cannot reuse the existing KV cache. To overcome this limitation, we introduce Quick
Instruction. We append a set of dedicated special tokens directly to the input sequence, where
each token corresponds to a specific auxiliary task. By directly reusing the already-computed
KV cache, this mechanism completely avoids redundant prefilling and allows certain tasks, such
as generating search queries and determining authority and domain, to be executed in parallel.
Consequently, this approach significantly reduces the user-perceived time-to-first-token (TTFT)
and eliminates the engineering overhead of maintaining and iterating an extra small model. The
supported Quick Instruction tokens are summarized in Table 5.
5.1.2. On-Policy Distillation
After training multiple domain-specific experts via specialized fine-tuning and reinforcement
learning, we employ multi-teacher On-Policy Distillation (OPD) as the primary technique for
merging expert capabilities into the final model. OPD has emerged as an effective post-training
paradigm for efficiently transferring the knowledge and capabilities of domain experts to aI'm realizing those markup tags are literal text examples in the paper that need to be escaped in a code block so pandoc doesn't interpret them. I'm going to write a custom build script that reads the raw PDF text output and processes it carefully, starting by skipping past the title and abstract sections.
The script will walk through the document detecting heading patterns and converting them to markdown, joining wrapped lines within paragraphs, removing page numbers, and identifying code-block content like the instruction text and tool-call schemas to wrap in fenced blocks. I'll also need to escape any raw <think> tags or <|DSML|...> tokens that appear in regular prose by wrapping them in backticks. For the references section at the end, I should check whether it's worth preserving as-is since it's a standard bibliography.
cd /Users/kiran/Downloads/pdf2kindle && rg -n "^References" /tmp/dsv4_raw.txt3475:References
cd /Users/kiran/Downloads/pdf2kindle && sed -n '3470,3530p' /tmp/dsv4_raw.txtwill continue to iterate and explore in this direction. We are also working on incorporating multimodal capabilities to our models. Finally, we are committed to developing better data curation and synthesis strategies to consistently enhance model intelligence, robustness, and practical usability across an increasingly broad range of scenarios and tasks. References AA. Gdpval-aa leaderboard, 2025. URL https://artificialanalysis.ai/methodolog y/intelligence-benchmarking#gdpval-aa. T. Achim, A. Best, A. Bietti, K. Der, M. Fédérico, S. Gukov, D. Halpern-Leistner, K. Henningsgard, Y. Kudryashov, A. Meiburg, et al. Aristotle: Imo-level automated theorem proving. arXiv preprint arXiv:2510.01346, 2025. A. Agache, M. Brooker, A. Florescu, A. Iordache, A. Liguori, R. Neugebauer, P. Piwonka, and D.-M. Popa. Firecracker: lightweight virtualization for serverless applications. In Proceedings of the 17th Usenix Conference on Networked Systems Design and Implementation, NSDI’20, page 419–434, USA, 2020. USENIX Association. ISBN 9781939133137. O. J. Aimuyo, B. Oh, and R. Singh. Flashmoe: Fast distributed moe in a single kernel. Advances in Neural Information Processing Systems, 2025. URL https://neurips.cc/virtual/2 025/poster/119124. J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023. J. Asher. LeanExplore: A search engine for Lean 4 declarations, 2025. URL https://arxiv.or g/abs/2506.11085. Y. Bai, Y. Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y. Chen, Y. Chen, Y. Chen, Z. Chen, J. Cui, H. Ding, M. Dong, A. Du, C. Du, D. Du, Y. Du, Y. Fan, Y. Feng, K. Fu, B. Gao, H. Gao, P. Gao, T. Gao, X. Gu, L. Guan, H. Guo, J. Guo, H. Hu, X. Hao, T. He, W. He, W. He, C. Hong, Y. Hu, Z. Hu, W. Huang, Z. Huang, Z. Huang, T. Jiang, Z. Jiang, X. Jin, Y. Kang, G. Lai, C. Li, F. Li, H. Li, M. Li, W. Li, Y. Li, Y. Li, Z. Li, Z. Li, H. Lin, X. Lin, Z. Lin, C. Liu, C. Liu, H. Liu, J. Liu, J. Liu, L. Liu, S. Liu, T. Y. Liu, T. Liu, W. Liu, Y. Liu, Y. Liu, Y. Liu, Y. Liu, Z. Liu, E. Lu, L. Lu, S. Ma, X. Ma, Y. Ma, S. Mao, J. Mei, X. Men, Y. Miao, S. Pan, Y. Peng, R. Qin, B. Qu, Z. Shang, L. Shi, S. Shi, F. Song, J. Su, Z. Su, X. Sun, F. Sung, H. Tang, J. Tao, Q. Teng, C. Wang, D. Wang, F. Wang, and H. Wang. Kimi K2: open agentic intelligence. CoRR, abs/2507.20534, 2025a. URL https://doi.org/10.48550/arXiv.2507.20534. Y. Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y. Dong, et al. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639–3664, 2025b. 45 M. Balunović, J. Dekoninck, I. Petrov, N. Jovanović, and M. Vechev. Matharena: Evaluating llms on uncontaminated math competitions. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmark, 2025. C. Bandi, B. Hertzberg, G. Boo, T. Polakam, J. Da, S. Hassaan, M. Sharma, A. Park, E. Hernandez, D. Rambado, et al. Mcp-atlas: A large-scale benchmark for tool-use competency with real mcp servers. arXiv preprint arXiv:2602.00933, 2026. F. Bellard. Qemu, a fast and portable dynamic translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC ’05, page 41, USA, 2005. USENIX Association. I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio. Neural combinatorial optimization with reinforcement learning, 2017. URL https://openreview.net/forum?id=rJY3vK9eg. J. Chen, W. Chen, J. Du, J. Hu, Z. Jiang, A. Jie, X. Jin, X. Jin, C. Li, W. Shi, Z. Wang, M. Wang, C. Wei, S. Wei, H. Xin, F. Yang, W. Gao, Z. Yuan, T. Zhan, Z. Zheng, T. Zhou, and T. H. Zhu. Seed-prover 1.5: Mastering undergraduate-level theorem proving via learning from experience, 2025. URL https://arxiv.org/abs/2512.17260. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage,
The references have some wrapped URLs that need fixing, and while the list is lengthy, it's valuable enough to keep. The author list is also quite long, so I'm weighing whether to move the references to an appendix for better Kindle readability or strip them entirely since they're not essential for that format.
cd /Users/kiran/Downloads/pdf2kindle && sed -n '3838,3900p' /tmp/dsv4_raw.txt53 Appendix A. Author List and Acknowledgment A.1. Author List Authors are listed alphabetically by their first name. Names marked with * denote individuals who have departed from our team. Research & Engineering: Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang*, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chenze Shao, Chong Ruan*, Conner Sun, Damai Dai, Daya Guo*, Dejian Yang, Deli Chen, Donghao Li, Erhang Li, Fangyun Lin, Fangzhou Yuan, Feiyu Xia, Fucong Dai, Guangbo Hao, Guanting Chen, Guoai Cao, Guolai Meng, Guowei Li, Han Yu, Han Zhang, Hanwei Xu, Hao Li, Haofen Liang, Haoling Zhang, Haoming Luo, Haoran Wei*, Haotian Yuan, Haowei Zhang*, Haowen Luo, Haoyu Chen, Haozhe Ji, Honghui Ding, Hongxuan Tang, Huanqi Cao, Huazuo Gao, Hui Qu, Hui Zeng, J. Yang, J.Q. Zhu, Jia Yu, Jialiang Huang, Jiasheng Ye, Jiashi Li, Jiaxin Xu, Jiewen Hu, Jin Yan, Jingchang Chen, Jingli Zhou, Jingting Xiang, Jingyang Yuan, Jingyuan Cheng, Jinhua Zhu, Jiping Yu, Joseph Sun, Jun Ran*, Junguang Jiang, Junjie Qiu, Junlong Li*, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Kexing Zhou, Kezhao Huang*, Kuai Yu, Lean Wang, Lecong Zhang, Lei Wang, Li Zhang, Liang Zhao, Lihua Guo, Lingxiao Luo, Linwang Ma, Litong Wang, Liyu Cai, Liyue Zhang, Longhao Chen, M.S. Di, M.Y Xu, Max Mei, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingxu Zhou, Panpan Huang, Peixin Cong, Peiyi Wang, Qiancheng Wang, Qihao Zhu, Qingyang Li, Qinyu Chen, Qiushi Du, Qiwei Jiang, Rui Tian, Ruifan Xu, Ruijie Lu, Ruiling Xu, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runqian Chen, Runqiu Yin, Runxin Xu, Ruomeng Shen, Ruoyu Zhang, S.H. Liu, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaofei Cai, Shaoheng Nie, Shaoyuan Chen, Shengding Hu, Shengyu Liu, Shiqiang Hu, Shirong Ma, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, Shuying Yu, Songyang Zhou, Tao Ni, Tao Yun, Tian Jin, Tian Pei, Tian Ye, Tianle Lin, Tianran Ji, Tianyi Cui, Tianyuan Yue, Tingting Yu, Tun Wang, W. Zhang, Wangding Zeng, Weilin Zhao, Wen Liu, Wenfeng Liang, Wenjie Pang, Wenjing Luo, Wenjing Yao, Wenjun Gao, Wenkai Yang, Wenlve Huang, Wentao Zhang, Wenting Ma, Xi Gao, Xiang He, Xiangwen Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaokang Zhang, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingchen Liu, Xingkai Yu, Xingyou Li, Xinyu Yang, Xu Chen, Xuanyu Wang, Xuecheng Su, Xuheng Lin, Xuwei Fu, Y.C. Yan, Y.Q. Wang*, Y.W. Ma, Yanfeng Luo, Yang Zhang, Yanhong Xu, Yanru Ma, Yanwen Huang, Yao Li, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Qian, Yi Yu, Yichao Zhang, Yifan Ding, Yifan Shi, Yijia Wu, Yiliang Xiong, Ying He, Ying Zhou, Yingjia Luo, Yinmin Zhong, Yishi Piao, Yisong Wang, Yixiang Zhang, Yixiao Chen, Yixuan Tan, Yixuan Wei, Yiyang Ma, Yiyuan Liu, Yonglun Yang, Yongqiang Guo, Yongtong Wu, Yu Wu, Yuan Cheng, Yuan Ou, Yuanfan Xu, Yuanhao Li, Yuduan Wang, Yuhan Wu, Yuhao Meng, Yuheng Zou, YuKun Li, Yunfan Xiong, Yupeng Chen, Yuqian Cao, Yuqian Wang, Yushun Zhang, Yutong Lin, Yuxian Gu, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuxuan Zhou, Yuyang Zhou, Yuzhen Huang, Z.F. Wu, Zehao Wang, Zehua Zhao, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhixian Huang, Zhixuan Chen, Zhiyu Wu, Zhizhou Ren, Zhuoshu Li, Zhuping Zhang, Zian Xu, Zihao Wang, Zihui Gu, Zijia Zhu, Zilin Li, Zipeng Zhang*, Ziwei Xie, Ziyi Gao, Zizheng Pan, Zongqing Yao. Business & Compliance: Chenchen Ling, Chengyu Hou, Dongjie Ji, Fang Wei, Hengqing Zhang, Jia Luo, Jia Song, Jialu Cai, Jian Liang, Jiangting Zhou, Jieyu Yang, Jin Chen, Jingzi Zhou, Junmin Zheng, Leyi Xia, Linyan Zhu, Miaojun Wang, Mingming Li, Minmin Han, Ning Wang, Panpan 54 Wang, Peng Zhang, Ruyi Chen, Shangmian Sun, Shaoqing Wu, W.L. Xiao, Wei An, Wenqing Hou, Xianzu Wang, Xiaowen Sun, Xiaoxiang Wang, Xinyu Zhang, Xueyin Chen, Yao Xu, Yi Shao, Yiling Ma, Ying Tang, Yuehan Yang, Yuer Xu, Yukun Zha, Yuping Lin, Yuting Yan, Zekai Zhang, Zhe Ju, Zheren Gao, Zhongyu Wu, Zihua Qu, Ziyi Wan. A.2. Acknowledgment We would like to thank Dolly Deng and other testers for their valuable suggestions and feedback regarding the capabilities of DeepSeek-V4 series models. B. Evaluation Details Table 9 | Agentic Search vs. Retrieval Augmented Search for DeepSeek-V4-Pro. Difficulty Category #
Now I'm mapping out the build script structure—I'll read the raw PDF text output, skip the initial boilerplate up to the introduction, then process each line to rejoin wrapped paragraphs and detect heading patterns so I can emit proper markdown formatting while filtering out extraneous content.
I need to handle several special cases: drop standalone page numbers, preserve figure and table captions as italicized text, strip orphan label-like lines conservatively, and wrap Tables 3 and 4 in code fences to protect their XML-like structure. For the <think> tags and similar constructs scattered through the prose, I'll escape them by wrapping in backticks to prevent XML parsing issues.
References can stay mostly as-is but I'll need to rejoin any URLs that got split across lines, and the appendix author list is fine to keep. The Unicode mathematical characters like 𝑍 and 𝐻 ·𝑊 should render fine without special handling—Pandoc won't misinterpret them as TeX math.
The real issue is stray $ characters triggering math mode and causing Pandoc to choke on sequences like $...$ that contain XML-like syntax. I need to search for and escape those dollar signs.
cd /Users/kiran/Downloads/pdf2kindle && rg -c '\$' /tmp/dsv4_raw.txt3
cd /Users/kiran/Downloads/pdf2kindle && rg -n '\$' /tmp/dsv4_raw.txt2256:<|DSML|invoke name="$TOOL_NAME"> 2257:<|DSML|parameter name="$PARAMETER_NAME" string="true|false">$PARAMETER_VALUE 2261:<|DSML|invoke name="$TOOL_NAME2">
I see that Table 4 has dollar signs in the schema that need to be wrapped in a code fence to avoid parsing issues. Table 3 looks fine with just plain instruction text. Now I'm moving on to draft the build script that will process the raw PDF text output, walking through the lines and emitting the formatted content.
I'm setting up heading detection patterns that anchor to the start of each line—single-level headings like "Introduction" and "Architecture," multi-level numbered sections, lettered appendices, and special sections like References. Page numbers that appear as standalone integers get dropped, while figure and table captions get formatted as italicized blocks.
For the problematic tables, I'm identifying them by their content boundaries—Table 3 ends before "model leverages" and Table 4 ends before "window of DeepSeek-V4"—and wrapping those ranges in code fences. Actually, I'm realizing the core issue is just escaping the special tokens like <think> and <|DSML|...> that appear in the schema, so I'll handle that with a global replacement approach instead.
Building the script now with a simpler architecture: read the raw text, skip everything before "1. Introduction," then process paragraph by paragraph. For each block, check if it's a heading pattern, a page number to skip, a figure or table caption, or regular prose that needs dehyphenation and HTML escaping.
Tables 3 and 4 get special handling by detecting their starting patterns and wrapping them in code fences. For URLs split across lines, I need to detect when a line contains http and the next line continues without a space, then join them without adding extra whitespace to preserve the URL structure.
For soft hyphens introduced by PDF wrapping, I'll use a simple heuristic: if a line ends with - and the next line starts with a lowercase letter, I'll remove the hyphen and join directly, assuming it's a word break rather than a genuine compound word.
For URLs split across lines, I need to be more careful. Rather than trying to detect every possible URL pattern upfront, I'll write a targeted regex that handles the specific broken patterns I see in the references, then do a broader cleanup pass to catch any remaining splits where a line ends with / or contains http and the next line continues with what looks like a URL fragment.
cd /Users/kiran/Downloads/pdf2kindle && rg -B0 -A1 -n "https?://" /tmp/dsv4_raw.txt | head -4018:are available at https://huggingface.co/collections/deepseek-ai/deepseek-v4. 19- -- 856:1 https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/tree/main/inference 857- -- 1397:2 https://github.com/deepseek-ai/DeepGEMM/pull/304 1398- -- 1479:3 https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplicat 1480-ion/index.html#wave-quant 1481:4 https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/writing-cuda-kernels 1482-.html#distributed-shared-memory -- 3476:AA. Gdpval-aa leaderboard, 2025. URL https://artificialanalysis.ai/methodolog 3477-y/intelligence-benchmarking#gdpval-aa. -- 3486:in Neural Information Processing Systems, 2025. URL https://neurips.cc/virtual/2 3487-025/poster/119124. -- 3491:J. Asher. LeanExplore: A search engine for Lean 4 declarations, 2025. URL https://arxiv.or 3492-g/abs/2506.11085. -- 3502:URL https://doi.org/10.48550/arXiv.2507.20534. 3503-Y. Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y. Dong, et al. Longbench -- 3520:reinforcement learning, 2017. URL https://openreview.net/forum?id=rJY3vK9eg. 3521-J. Chen, W. Chen, J. Du, J. Hu, Z. Jiang, A. Jie, X. Jin, X. Jin, C. Li, W. Shi, Z. Wang, M. Wang, -- 3524:experience, 2025. URL https://arxiv.org/abs/2512.17260. 3525-M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, -- 3533:URL https://arxiv.org/abs/2107.03374. 3534-T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, -- 3537:ISBN 978-1-939133-08-3. URL https://www.usenix.org/conference/osdi18/prese 3538-ntation/chen. -- 3545:07372. URL https://doi.org/10.48550/arXiv.2601.07372. 3546-K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek,
I'm noticing a pattern where URLs get split across line breaks without any special formatting—like https://artificialanalysis.ai/methodolog continuing as y/intelligence-benchmarking#gdpval-aa. on the next line. To catch this, I need to check if a line contains a URL scheme and ends with a non-space character while the next line starts with one, then join them directly.
Implementing a function that handles paragraph line joining... Refining the regex approach...
I'm realizing the regex needs to be smarter about when to stitch URLs across line breaks. The issue is distinguishing between a genuinely broken URL and a URL that's legitimately followed by new prose. If the next line starts with a capital letter, it's probably a new sentence, not a URL continuation. So I should only stitch when the fragment after the space looks like it's part of a URL—lowercase start, URL-safe characters—rather than the beginning of a new sentence. I'm realizing the issue with my regex approach — when I join lines with a space, I'm accidentally capturing the start of the next paragraph as part of the URL match. I need a better heuristic to detect when a URL is actually split across lines versus when it's genuinely followed by new content.
Looking at the actual example, the URL https://artificialanalysis.ai/methodolog ends mid-word on one line, and the next line continues with y/intelligence-benchmarking#gdpval-aa. — so I should join without a space when the previous line's URL ends without a clear terminator and the next line starts with URL-class characters like slashes or dots.
The simplest approach: when the last word of a line contains http and the next line begins with lowercase letters or URL punctuation like /, ?, #, or ., treat it as a URL continuation and join without spacing.
cd /Users/kiran/Downloads/pdf2kindle && awk 'NR==FNR && /https?:\/\// {url[NR]=1; next} url[FNR-1]' /tmp/dsv4_raw.txt /tmp/dsv4_raw.txt | head -30ion/index.html#wave-quant .html#distributed-shared-memory y/intelligence-benchmarking#gdpval-aa. 025/poster/119124. g/abs/2506.11085. Y. Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y. Dong, et al. Longbench J. Chen, W. Chen, J. Du, J. Hu, Z. Jiang, A. Jie, X. Jin, X. Jin, C. Li, W. Shi, Z. Wang, M. Wang, M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, ntation/chen. K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, 46 L. De Moura and N. Bjørner. Z3: an efficient smt solver. In Proceedings of the Theory 931. org/10.48550/arXiv.2412.19437. 434. DeepSeek-AI. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nat., DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025. URL X. Deng, J. Da, E. Pan, Y. Y. He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, H. Ding, Z. Wang, G. Paolini, V. Kumar, A. Deoras, D. Roth, and S. Soatto. Fewer truncations X. Du, Y. Yao, K. Ma, B. Wang, T. Zheng, K. Zhu, M. Liu, Y. Liang, X. Jin, Z. Wei, et al. Supergpqa: X. Gao, M. Dong, X. Miao, W. Du, C. Yu, and H. Chen. Erofs: a compression-friendly readonly g/10.48550/arXiv.2406.04127. L. Haas, G. Yona, G. D’Antonio, S. Goldshtein, and D. Das. Simpleqa verified: A reliable B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. H. Li, Y. Yuan, R. Du, K. Ma, L. Liu, and W. Hsu. DADI: Block-Level image service for agile 48
I'm parsing out the URL fragments to identify where each link is pointing—there's a performance matrix doc from NVIDIA, a CUDA programming guide section on kernels, and what looks like an artificial intelligence benchmarking resource.
The pattern I'm noticing is that when a URL fragment continues on the next line, it starts with lowercase characters or digits rather than a capital letter like a normal sentence would. So my heuristic is straightforward: if the previous line ends with http:// or https:// and the next line begins with a lowercase letter or digit, I should join them without a space.
Now I'm ready to write the build script that applies this logic.
Now I have enough to design a robust custom builder. Let me create a dedicated script for this paper.
cd /Users/kiran/Downloads/pdf2kindle && rg -n "Total \(总计\)|Subtotal" /tmp/dsv4_raw.txt | head -103950:Total (总计) 4008:Subtotal (小计) 4135:Subtotal (小计) 4241:Subtotal (小计) 4311:Subtotal (小计) 4380:Subtotal (小计) 4441:Subtotal (小计) 4500:Subtotal (小计) 4557:Subtotal (小计) 4596:Total (总计)
cd /Users/kiran/Downloads/pdf2kindle && sed -n '3895,4020p' /tmp/dsv4_raw.txtB. Evaluation Details Table 9 | Agentic Search vs. Retrieval Augmented Search for DeepSeek-V4-Pro. Difficulty Category # Agent Win RAG Win Tie Agent% RAG% Tie% Easy Objective Q&A (客观问答) 196 Subjective Q&A (主观问答) 321 110 198 43 56 43 67 56.1 61.7 21.9 17.4 21.9 20.9 Hard Objective Q&A (客观问答) 168 Subjective Q&A (主观问答) 184 102 126 33 27 33 31 60.7 68.5 19.6 14.7 19.6 16.8 Total (总计) 536 159 174 61.7 18.3 20.0 869 Table 10 | Cost Comparison:Agentic Search vs. Retrieval Augmented Search (Mean) for DeepSeek-V4-Pro. Most of the tool calls are parallel for Agentic Search. Version Tool Calls Prefill (tokens) Output (tokens) 16.2 — 13649 10453 1526 1308 V4 Agentic Search V4 Retrieval Augmented Search Table 11 | Comparative Evaluation of DeepSeek-V4-Pro and DeepSeek-V3.2 on Search Q&A Tasks. Internal Evaluation (内部综合评估) Category Subcategory # V4 win V3.2 win tie V4% V3.2% tie% 95 Objective Single-value Search (单值信息查找) Entity Search (实体信息查找) 99 Q&A Enumerative Search (枚举型信息查找) 95 (客观问答) Subtotal (小计) 289 36 24 19 10 7 8 49 68
I'm noticing the Evaluation Details appendix has multi-column tables with Chinese subtitles that won't format properly in markdown, so I need to check how extensive this section is.
cd /Users/kiran/Downloads/pdf2kindle && sed -n '4595,4905p' /tmp/dsv4_raw.txt | head -100 && echo "---" && wc -l /tmp/dsv4_raw.txtTotal (总计)
57
Table 13 | Comparative Analysis of DeepSeek-V4-Pro and Gemini-3.1-Pro in Chinese Creative
Writing.
Subcategory (文体)
#
Instruction Following(指令遵循)
Writing Quality (写作质量)
DS Gem Tie DS% Gem% Tie%
DS Gem Tie DS% Gem% Tie%
504
368
253
111
115
83
74
81
24
23
8
20
15
11
5
2
5
1
Fiction (小说故事)
General Fiction (泛小说故事)
Fan Fiction (同人文)
General Fan Fic. (泛同人文)
Narrative (记叙文)
General Prose (泛散文)
Prose (散文)
Writing Style (文笔)
Classical Poetry (古诗文)
Modern Poetry (现代诗)
Lyrics (歌词)
Literary Appreciation (赏析)
General Argument. (泛议论文)
General Narrative (泛记叙文)
General Classical (泛古文诗歌)
Creative Writing (创意写作)
Argumentative (议论文)
General Mod. Poetry (泛现代诗)
836
662
410
202
171
124
112
112
48
43
30
27
24
23
9
6
5
2
323
290
150
90
54
40
38
31
24
20
22
7
9
12
4
4
0
1
Total (总计)
2837 1703 1119
---
4905 /tmp/dsv4_raw.txtI'm deciding to keep the appendix section headers and captions but preserve the dense multi-column tables as preformatted blocks rather than trying to reflow them into markdown, since they'd be hard to read on a Kindle anyway.
Now I'm moving on to write the build script.
cd /Users/kiran/Downloads/pdf2kindle && pdftotext -layout -f 56 -l 58 /Users/kiran/Downloads/DeepSeek_V4.pdf - | head -120Figure 14 | Example output of a task that requires comparing two regular investment strategies
for the NASDAQ.
Figure 15 | Example output of a task which requires researching 2020-2025 Nobel Science Prizes
and generating an analytical PDF report.
56
Table 12 | Comparative Analysis of DeepSeek-V4-Pro and Gemini-3.1-Pro in Chinese Functional
Writing.
Internal Evaluation (内部综合评估)
Category Subcategory # DS win Gem win Tie DS% Gem% Tie%
Report (报告) 527 350 162 15 66.41 30.74 2.85
Proposal (方案策划) 291 181 103 7 62.20 35.40 2.41
Education (教育培训) 159 100 56 3 62.89 35.22 1.89
Business Email & Letter (邮件书信) 146 107 37 2 73.29 25.34 1.37
Writing Notice (通知公告) 72 43 24 5 59.72 33.33 6.94
(办公文本) Professional (专业文本) 63 34 27 2 53.97 42.86 3.17
Recruitment (招聘求职) 42 27 15 0 64.29 35.71 0.00
Technical (技术文本) 29 22 7 0 75.86 24.14 0.00
Review (介绍评价) 20 15 5 0 75.00 25.00 0.00
Subtotal (小计) 1349 879 436 34 65.16 32.32 2.52
Social Media (社交媒体文案) 267 156 101 10 58.43 37.83 3.75
Ad Copy (广告商品文案) 214 109 98 7 50.93 45.79 3.27
Long-form Content (内容平台长文) 99 71 25 3 71.72 25.25 3.03
Media News Report (新闻报道) 51 27 22 2 52.94 43.14 3.92
Writing Advertorial (营销软文) 17 12 4 1 70.59 23.53 5.88
(媒体文本) Headline (标题) 11 7 4 0 63.64 36.36 0.00
Narration Script (口播文案) 4 2 1 1 50.00 25.00 25.00
Comment (评论) 3 2 1 0 66.67 33.33 0.00
Subtotal (小计) 666 386 256 24 57.96 38.44 3.60
Congratulatory (祝贺文本) 101 54 41 6 53.47 40.59 5.94
Everyday Communication (沟通回复) 100 71 26 3 71.00 26.00 3.00
Writing Reflection (心得感想) 90 68 17 5 75.56 18.89 5.56
Review (介绍评价) 55 44 9 2 80.00 16.36 3.64
(生活文本)
Comment (评论) 44 34 8 2 77.27 18.18 4.55
Subtotal (小计) 390 271 101 18 69.49 25.90 4.62
Speech (发言稿) 226 135 85 6 59.73 37.61 2.65
Narration Script (口播文案) 51 25 23 3 49.02 45.10 5.88
Oral
Writing Sales Script (话术) 31 22 6 3 70.97 19.35 9.68
Dialogue (对话文本) 10 4 6 0 40.00 60.00 0.00
(口头文本)
Congratulatory (祝贺文本) 1 1 0 0 100.00 0.00 0.00
Subtotal (小计) 319 187 120 12 58.62 37.62 3.76
Administrative Doc (事务文书) 117 60 53 4 51.28 45.30 3.42
Personal Doc (个人文书) 73 45 27 1 61.64 36.99 1.37
Official
Document Government Doc (行政公文) 34 19 14 1 55.88 41.18 2.94
(公文文本) Speech (发言稿) 3 1 2 0 33.33 66.67 0.00
Essay Writing (申论写作) 3 1 1 1 33.33 33.33 33.33
Subtotal (小计) 230 126 97 7 54.78 42.17 3.04
Research Paper (学术论文) 104 67 32 5 64.42 30.77 4.81
Academic Coursework (课程作业) 90 53 35 2 58.89 38.89 2.22
Writing Academic Support (学术辅助) 15 11 3 1 73.33 20.00 6.67
(学术文本) Science Outreach (专业科普) 7 6 1 0 85.71 14.29 0.00
Subtotal (小计) 216 137 71 8 63.43 32.87 3.70
Total (总计) 3170 1986 1081 103 62.65 34.10 3.25
57
Table 13 | Comparative Analysis of DeepSeek-V4-Pro and Gemini-3.1-Pro in Chinese Creative
Writing.
Instruction Following(指令遵循) Writing Quality (写作质量)
Subcategory (文体) # DS Gem Tie DS% Gem% Tie% DS Gem Tie DS% Gem% Tie%
Fiction (小说故事) 836 504 323 5 60.58 38.82 0.60 672 157 3 80.77 18.87 0.36
General Fiction (泛小说故事) 662 368 290 3 55.67 43.87 0.45 467 194 0 70.65 29.35 0.00
Fan Fiction (同人文) 410 253 150 3 62.32 36.95 0.74 338 67 1 83.25 16.50 0.25
General Fan Fic. (泛同人文) 202 111 90 1 54.95 44.55 0.50 161 40 1 79.70 19.80 0.50
Narrative (记叙文) 171 115 54 2 67.25 31.58 1.17 141 30 0 82.46 17.54 0.00
General Prose (泛散文) 124 83 40 1 66.94 32.26 0.81 88 36 0 70.97 29.03 0.00
Prose (散文) 112 74 38 0 66.07 33.93 0.00 92 20 0 82.14 17.86 0.00
Writing Style (文笔) 112 81 31 0 72.32 27.68 0.00 86 26 0 76.79 23.21 0.00
Classical Poetry (古诗文) 48 24 24 0 50.00 50.00 0.00 39 9 0 81.25 18.75 0.00
Modern Poetry (现代诗) 43 23 20 0 53.49 46.51 0.00 32 11 0 74.42 25.58 0.00
Lyrics (歌词) 30 8 22 0 26.67 73.33 0.00 16 14 0 53.33 46.67 0.00
Literary Appreciation (赏析) 27 20 7 0 74.07 25.93 0.00 18 9 0 66.67 33.33 0.00
General Argument. (泛议论文) 24 15 9 0 62.50 37.50 0.00 17 7 0 70.83 29.17 0.00
General Narrative (泛记叙文) 23 11 12 0 47.83 52.17 0.00 15 8 0 65.22 34.78 0.00
General Classical (泛古文诗歌) 9 5 4 0 55.56 44.44 0.00 5 4 0 55.56 44.44 0.00
Creative Writing (创意写作) 6 2 4 0 33.33 66.67 0.00 4 2 0 66.67 33.33 0.00
Argumentative (议论文) 5 5 0 0 100.00 0.00 0.00 5 0 0 100.00 0.00 0.00
General Mod. Poetry (泛现代诗) 2 1 1 0 50.00 50.00 0.00 2 0 0 100.00 0.00 0.00
Total (总计) 2837 1703 1119 15 60.03 39.44 0.53 2198 634 5 77.48 22.35 0.18
Table 14 | DeepSeek-V4-Pro vs. Claude-Opus-4.5 on Complex Instruction Following and Multi-
Turn Writing.
Internal Evaluation (内部综合评估)
Category # DS Opus Tie DS% Opus% Tie%
Complex Inst. Following (复杂指令跟随) 49 23 26 0 46.9% 53.1% 0.0%
Multi-Turn Writing (多轮写作) 147 67 76 4 45.6% 51.7% 2.7%
Total (总计) 196 90 102 4 45.9% 52.0% 2.0%
58I'm going to extract the text content using the standard pdftotext output for the main body up through the evaluation details section, then switch to layout-mode extraction for the appendix tables so they preserve their formatting in preformatted code blocks.
cd /Users/kiran/Downloads/pdf2kindle && python3 build_deepseek_v4.py /Users/kiran/Downloads/DeepSeek_V4.pdf -o deepseek-v4.md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" 2>&1 | tail -10main()
~~~~^^
File "/Users/kiran/Downloads/pdf2kindle/build_deepseek_v4.py", line 354, in main
body_md, appendix_idx, eval_idx, paragraphs = build_body(raw_lines)
~~~~~~~~~~^^^^^^^^^^^
File "/Users/kiran/Downloads/pdf2kindle/build_deepseek_v4.py", line 263, in build_body
appendix_idx = find_appendix_start(paragraphs)
File "/Users/kiran/Downloads/pdf2kindle/build_deepseek_v4.py", line 221, in find_appendix_start
raise ValueError("could not locate appendix start")
ValueError: could not locate appendix startcd /Users/kiran/Downloads/pdf2kindle && rg -n "^Appendix$|Author List and Acknow" /tmp/dsv4_raw.txt475:A Author List and Acknowledgment 3841:A. Author List and Acknowledgment
I'm checking where the "Appendix" header appears in the document—it should be on its own line before line 3841.
cd /Users/kiran/Downloads/pdf2kindle && sed -n '3835,3845p' /tmp/dsv4_raw.txtRepresentations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL http s://openreview.net/forum?id=YrycTjllL0. 53 Appendix A. Author List and Acknowledgment A.1. Author List Authors are listed alphabetically by their first name. Names marked with * denote individuals who have departed from our team. Research & Engineering: Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang*, Bingzheng Xu,
I'm noticing the paragraph detection is treating "Appendix" and "A. Author List and Acknowledgment" as separate when they should be grouped together without a blank line between them. Looking at the actual input structure, these lines are part of the same section, so I need to adjust how I'm splitting paragraphs to keep them connected.
The cleanest approach is to iterate through individual lines rather than pre-split paragraphs, treating heading lines themselves as natural paragraph boundaries so I can properly separate the appendix sections.
cd /Users/kiran/Downloads/pdf2kindle && python3 build_deepseek_v4.py /Users/kiran/Downloads/DeepSeek_V4.pdf -o deepseek-v4.md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" 2>&1 | tail -10Wrote deepseek-v4.md (1828 lines)
cd /Users/kiran/Downloads/pdf2kindle && rg -n "^#" deepseek-v4.md | head -807:# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
12:## 1. Introduction
39:## 2. Architecture
48:### 2.1 Designs Inherited from DeepSeek-V3
54:### 2.2 Manifold-Constrained Hyper-Connections
237:### 2.3 Hybrid Attention with CSA and HCA
243:#### 2.3.1 Compressed Sparse Attention
435:#### 2.3.2 Heavily Compressed Attention
504:#### 2.3.3 Other Details
513:#### 2.3.4 Efficiency Discussion
522:### 2.4 Muon Optimizer
534:## 3. General Infrastructures
537:### 3.1 Fine-Grained Communication-Computation Overlap in Expert Parallelism
621:### 3.2 Flexible and Efficient Kernel Development with TileLang
633:### 3.3 High-Performance Batch-Invariant and Deterministic Kernel Libraries
642:### 3.4 FP4 Quantization-Aware Training
651:### 3.5 Training Framework
657:#### 3.5.1 Efficient Implementation of Muon
663:#### 3.5.2 Cost-Effective and Memory-Efficient Implementation of mHC
669:#### 3.5.3 Contextual Parallelism for Long-Context Attention
675:#### 3.5.4 Extended Automatic Differentiation for Flexible Activation Checkpointing
684:### 3.6 Inference Framework
690:#### 3.6.1 KV Cache Structure and Management
792:#### 3.6.2 On-Disk KV Cache Storage
801:## 4. Pre-Training
804:### 4.1 Data Construction
813:### 4.2 Pre-Training Setups
816:#### 4.2.1 Model Setups
822:#### 4.2.2 Training Setups
831:#### 4.2.3 Mitigating Training Instability
840:### 4.3 Evaluations
843:#### 4.3.1 Evaluation Benchmarks
852:# Shots
894:#### 4.3.2 Evaluation Results
903:## 5. Post-Training
906:### 5.1 Post-Training Pipeline
912:#### 5.1.1 Specialist Training
954:## 2. <think> thinking
969:## Tools You have access to a set of tools to help answer the user’s question. You can invoke tools by writing a "`<|DSML|tool_calls>`" block like the following: `<|DSML|tool_calls>` `<|DSML|invoke name="$TOOL_NAME">` `<|DSML|parameter name="$PARAMETER_NAME" string="true|false">`$PARAMETER_VALUE `</|DSML|parameter>` ... `</|DSML|invoke>` `<|DSML|invoke name="$TOOL_NAME2">` ... `</|DSML|invoke>` `</|DSML|tool_calls>` String parameters should be specified as is and set ‘string="true"‘. For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set ‘string="false"‘. If thinking_mode is enabled (triggered by `<think>`), you MUST output your complete reasoning inside `<think>`...`</think>` BEFORE any tool calls or final response. Otherwise, output directly after `</think>` with tool calls or final response. ### Available Tool Schemas {Tool Definition...} You MUST strictly follow the above definedtool name and parameter schemas to invoke tool calls.
984:#### 5.1.2 On-Policy Distillation
1062:### 5.2 RL and OPD Infrastructures
1068:#### 5.2.1 FP4 Quantization Integration
1074:#### 5.2.2 Efficient Teacher Scheduling for Full-Vocabulary OPD
1080:#### 5.2.3 Preemptible and Fault-Tolerant Rollout Service
1089:#### 5.2.4 Scaling RL Framework for Million-Token Context
1095:#### 5.2.5 Sandbox Infrastructure for Agentic AI
1104:### 5.3 Standard Benchmark Evaluation
1107:#### 5.3.1 Evaluation Setup
1119:#### 5.3.2 Evaluation Results
1347:### 5.4 Performance on Real-World Tasks
1353:#### 5.4.1 Chinese Writing
1359:#### 5.4.2 Search
1365:#### 5.4.3 White-Collar Task
1509:#### 5.4.4 Code Agent
1536:## 6. Conclusion, Limitations, and Future Directions
1548:## A. Agache, M. Brooker, A. Florescu, A. Iordache, A. Liguori, R. Neugebauer, P. Piwonka, and
1557:## B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet,
1563:## A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse,
1566:## A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage,
1572:## A. Cheng, A. Jacovi, A. Globerson, B. Golan, C. Kwong, C. Alberti, C. Tao, E. Ben-David, G. S.
1581:## A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao,
1587:## B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko.
1608:## B. Shahriari, A. Ram’e, J. Ferret, P. Liu, P. D. Tafti, A. Friesen, M. Casbon, S. Ramos, R. Kumar,
1614:## A. Hutchison, A. Abdagic, A. Carl, A. Shen, A. Brock, A. Coenen, A. Laforge, A. Paterson,
1617:## B. Bastian, B. Piot, B. Wu, B. Royal, C. Chen, C. Kumar, C. Perry, C. A. Welty, C. A. ChoquetteChoo, D. Sinopalnikov, D. Weinberger, D. Vijaykumar, D. Rogozi’nska, D. Herbison, E. Bandy,
1626:## B. D. Rouhani, R. Zhao, A. More, M. Hall, A. Khodamoradi, S. Deng, D. Choudhary, M. Cornea,
1635:## A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30,
1641:## B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou,
1654:## A. Author List and Acknowledgment
1657:### A.1 Author List
1666:### A.2 Acknowledgment
1672:## B. Evaluation DetailsI'm spotting a few parsing errors in the output. There's a bogus heading at line 852, a leftover instruction fragment at line 954 where the "2." prefix got misinterpreted as a chapter marker, and another false match at line 969 where the Table 4 content triggered the heading regex because it started with ## Tools. I need to refine the pattern matching to avoid these false positives.
The core issue is that the original PDF had literal hash marks in the tool-call schema, which my code didn't escape properly. I should wrap those tables in code blocks to prevent them from being parsed as headings. For the references section, the real problem is that author names like "A. Agache" and "B. Jacob" are matching the appendix heading pattern when they shouldn't. I need to tighten the regex to only recognize actual appendix headers by constraining it to specific titles or requiring the line to be short and in title case, rather than matching any capital letter followed by a period. I'm realizing the issue — references have commas in their author lists, so I can use that to distinguish them from actual appendix headings. Let me hardcode the known appendix section titles instead of relying on regex patterns that keep matching reference lines.
I see the issue now — the PDF table extraction pulled the column header "# Shots" into the text as a separate line, and my heading detection regex mistakenly tried to parse it as a markdown heading, but the pattern doesn't actually match that format.
Looking back at the source, "# Shots" is literally what pdftotext extracted, and since it starts with #, it's being rendered as a markdown heading when it shouldn't be. I need to escape leading # characters in paragraphs that aren't actual headings.
The real problem is that Tables 1 through 8 are getting mangled during extraction — the dense numeric data and column alignment are collapsing into fragmented paragraphs. I should handle these tables the same way I did with Tables 9-14 by wrapping them in code blocks, either by using pdftotext -layout mode or by identifying and isolating the messy table regions directly.
cd /Users/kiran/Downloads/pdf2kindle && pdftotext -layout -f 28 -l 29 /Users/kiran/Downloads/DeepSeek_V4.pdf - | head -60Table 1 | Comparison among DeepSeek-V3.2-Base, DeepSeek-V4-Flash-Base, and DeepSeek-V4-
Pro-Base. All models are evaluated in our internal framework and share the same evaluation
setting. Scores with a gap not exceeding 0.3 are considered to be at the same level. The highest
score in each row is in bold font, and the second is underlined.
DeepSeek-V3.2 DeepSeek-V4-Flash DeepSeek-V4-Pro
Benchmark (Metric) # Shots
Base Base Base
Architecture - MoE MoE MoE
# Activated Params - 37B 13B 49B
# Total Params - 671B 284B 1.6T
AGIEval (EM) 0-shot 80.1 82.6 83.1
MMLU (EM) 5-shot 87.8 88.7 90.1
MMLU-Redux (EM) 5-shot 87.5 89.4 90.8
MMLU-Pro (EM) 5-shot 65.5 68.3 73.5
MMMLU (EM) 5-shot 87.9 88.8 90.3
C-Eval (EM) 5-shot 90.4 92.1 93.1
World Knowl.
CMMLU (EM) 5-shot 88.9 90.4 90.8
MultiLoKo (EM) 5-shot 38.7 42.2 51.1
Simple-QA verified (EM) 25-shot 28.3 30.1 55.2
SuperGPQA (EM) 5-shot 45.0 46.5 53.9
FACTS Parametric (EM) 25-shot 27.1 33.9 62.6
TriviaQA (EM) 5-shot 83.3 82.8 85.6
BBH (EM) 3-shot 87.6 86.9 87.5
DROP (F1) 1-shot 88.2 88.6 88.7
Lang. & Reas. HellaSwag (EM) 0-shot 86.4 85.7 88.0
WinoGrande (EM) 0-shot 78.9 79.5 81.5
CLUEWSC (EM) 5-shot 83.5 82.2 85.2
BigCodeBench (Pass@1) 3-shot 63.9 56.8 59.2
HumanEval (Pass@1) 0-shot 62.8 69.5 76.8
GSM8K (EM) 8-shot 91.1 90.8 92.6
Code & Math
MATH (EM) 4-shot 60.5 57.4 64.5
MGSM (EM) 8-shot 81.3 85.7 84.4
CMath (EM) 3-shot 92.6 93.6 90.9
Long Context LongBench-V2 (EM) 1-shot 40.2 44.7 51.5
4.3.2. Evaluation Results
In Table 1, we provide a detailed comparison of the base models for DeepSeek-V3.2, DeepSeek-
V4-Flash, and DeepSeek-V4-Pro, all evaluated under a unified internal framework with strictly
consistent settings.
Comparing DeepSeek-V4-Flash-Base with DeepSeek-V3.2-Base reveals a compelling ef-
ficiency story. Despite utilizing a substantially smaller number of both activated and total
parameters, DeepSeek-V4-Flash-Base outperforms DeepSeek-V3.2-Base across a wide array of
benchmarks. This advantage is especially evident in world knowledge tasks and challenging
long-context scenarios. These results underscore that architectural improvements, refined data
quality, and training optimizations in DeepSeek-V4-Flash-Base yield superior performance even
with a more compact parameter budget, effectively surpassing the larger DeepSeek-V3.2-Base
on the majority of evaluations.
Furthermore, DeepSeek-V4-Pro-Base demonstrates a further, decisive leap in capability,
establishing near-universal dominance over both DeepSeek-V3.2-Base and DeepSeek-V4-Flash-
Base. With improvements across almost all categories, DeepSeek-V4-Pro-Base reaches new
28
performance highs among DeepSeek base models on the most demanding benchmarks. OnI'm realizing that wrapping tables in layout-mode preformatted blocks makes them much more readable. My approach is to identify each table region in the source by finding the "Table N |" headers, then extract just those page ranges using pdftotext with layout mode to preserve the formatting. I'll combine this with the raw reflowed text I'm already generating to get the best of both. I'm mapping out the table and figure locations across the document—Tables 1 through 8 are scattered through the main sections, with 9-14 in the appendix. For figures, I just need to keep the captions and drop the fragmented chart data that follows. My best approach is to switch into a skip mode whenever I hit a Figure or Table caption, then resume normal processing once I reach actual prose again. This avoids the post-processing complexity and handles the cleanup in a single pass. I'm refining the paragraph processor to be stateful—when it encounters a figure or table caption, it enters a mode where it skips noise paragraphs until hitting real prose again. For tables specifically, I'll insert a layout-mode code block right after the caption, then drop the fragmented content that follows. To make this work, I need to precompute page numbers for each table by scanning the layout text, though I'm also considering a simpler approach: just drop the table fragments entirely and rely on captions, or preserve the most important tables via layout fallback in an appendix-like section at the end.
Actually, I'm realizing that putting tables at the end breaks reading flow—the reader sees "as shown in Table 6" but has to flip to find it. Better to insert the layout-extracted table inline right after the caption for referenced tables, while dropping figure noise entirely. So my plan is to pre-scan the PDF layout to find each table location and capture its full content, then in the main processor, when I encounter a table caption, I'll replace the following noise with the extracted table in a code block. Now I need to check what the actual table formatting looks like in the layout text.
cd /Users/kiran/Downloads/pdf2kindle && pdftotext -layout -f 30 -l 32 /Users/kiran/Downloads/DeepSeek_V4.pdf - | head -120Table 2 | Comparison of three reasoning modes
Reasoning Characteristics Typical Use Cases Response Format
Mode
Non-think Fast, intuitive re- Routine daily tasks, </think> summary
sponses based on emergency reactions,
habits or simple low-risk decisions.
rules.
Think High Conscious logical Complex problem- <think> thinking
analysis, slower but solving, planning, tokens </think>
more accurate. medium-risk deci- summary
sions.
Think Max Push reasoning to its Exploring the bound- 1. A special system
fullest extent. Slow ary of model reason- prompt at the begin-
but powerful. ing capability. ning.
2. <think> thinking
tokens </think>
summary
Table 3 | Instruction injected into the system prompt for the "Think Max" mode.
Injected Instruction
Reasoning Effort: Absolute maximum with no shortcuts permitted.
You MUST be very thorough in your thinking and comprehensively decompose the
problem to resolve the root cause, rigorously stress-testing your logic against all potential
paths, edge cases, and adversarial scenarios.
Explicitly write out your entire deliberation process, documenting every intermediate
step, considered alternative, and rejected hypothesis to ensure absolutely no assumption
is left unchecked.
model leverages its own logic to generalize across complex tasks.
Tool-Call Schema and Special Token. Consistent with our previous version, we utilize a
dedicated <think></think> tag to delineate the reasoning path. In DeepSeek-V4 series, we
introduce a new tool-call schema that employs a special "|DSML|" token and utilizes an XML-
based format for tool invocations, as demonstrated in Table 4. Our experiments demonstrate that
the XML format effectively mitigates escaping failures and reduces tool-call errors, providing a
more robust interface for model-tool interactions.
Interleaved Thinking. DeepSeek-V3.2 introduced a context management strategy that retains
reasoning traces across tool-result rounds but discards them upon the arrival of new user mes-
sages. While effective, this still caused unnecessary token waste in complex agentic workflows
— each new user turn would flush all accumulated reasoning content, forcing the model to
reconstruct its problem-solving state from scratch. Leveraging the expanded 1M-token context
30
Table 4 | Tool-call schema for DeepSeek-V4 series.
Tool Call Schema
## Tools
You have access to a set of tools to help answer the user’s question. You can
invoke tools by writing a "<|DSML|tool_calls>" block like the following:
<|DSML|tool_calls>
<|DSML|invoke name="$TOOL_NAME">
<|DSML|parameter name="$PARAMETER_NAME" string="true|false">$PARAMETER_VALUE
</|DSML|parameter>
...
</|DSML|invoke>
<|DSML|invoke name="$TOOL_NAME2">
...
</|DSML|invoke>
</|DSML|tool_calls>
String parameters should be specified as is and set ‘string="true"‘. For all
other types (numbers, booleans, arrays, objects), pass the value in JSON
format and set ‘string="false"‘.
If thinking_mode is enabled (triggered by <think>), you MUST output your
complete reasoning inside <think>...</think> BEFORE any tool calls or
final response.
Otherwise, output directly after </think> with tool calls or final response.
### Available Tool Schemas
{Tool Definition...}
You MUST strictly follow the above definedtool name and parameter schemas to
invoke tool calls.
window of DeepSeek-V4 series, we further refine this mechanism to maximize the effectiveness
of interleaved thinking in agentic environments:
• Tool-Calling Scenarios. As illustrated in Figure 7(a), all reasoning content is fully pre-
served throughout the entire conversation. Unlike DeepSeek-V3.2, which discarded
thinking traces upon each new user turn, DeepSeek-V4 series retain the complete reason-
ing history across all rounds, including across user message boundaries. This allows the
model to maintain a coherent, cumulative chain of thought over long-horizon agent tasks.
• General Conversational Scenarios. As illustrated in Figure 7(b), the original strategy is
preserved: reasoning content from previous turns is discarded when a new user message
arrives, keeping the context concise for settings where persistent reasoning traces provide
limited benefit.
As with DeepSeek-V3.2, agent frameworks that simulate tool interactions via user messages (e.g.,
Terminus) may not trigger the tool-calling context path and thus may not benefit from enhanced
reasoning persistence. We continue to recommend non-think models for such architectures.
31
a) Thinking with tools
b) Thinking without tools
Figure 7 | Thinking management of DeepSeek-V4 series.I'm reconsidering the approach—instead of trying to preserve layout mode across the whole body, I'll extract it for sections 1-6 as a single source, then identify tables as indented multi-line blocks and fence them separately. This should handle Tables 1-8 cleanly without the complications of mixed whitespace handling.
The hybrid strategy is cleaner: use raw extraction for prose (which reflows naturally), but pull tables specifically from layout mode where they stay aligned. I can detect tables by their captions like "Table N |..." and map them back to the layout extraction. Then when processing the raw paragraph stream, I'll swap in the fenced table content when I hit a table caption, skip any noise paragraphs after it, and do the same for figures without inserting anything.
For the references section, I need to switch modes once we hit "References"—stop treating heading-like patterns as paragraph breaks since author names will match those patterns. I'll detect the references region by checking if we've encountered a "References" paragraph, then disable the heading-split logic for everything after that point.
Designing the processor...
I need to handle the References section properly by detecting it as a heading, then restrict appendix matching to known titles like "A. Author List and Acknowledgment" and "B. Evaluation Details" so they're recognized as proper headings rather than paragraph content. I'll also drop the standalone "Appendix" line since the next heading already marks the appendix start, and escape any stray # characters that appear in table fragments.
Let me significantly improve the build script with state-tracking and table handling.
cd /Users/kiran/Downloads/pdf2kindle && python3 build_deepseek_v4.py /Users/kiran/Downloads/DeepSeek_V4.pdf -o deepseek-v4.md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" 2>&1 | tail -10 && rg -n "^#" deepseek-v4.md | head -100Wrote deepseek-v4.md
7:# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
775:## Tools You have access to a set of tools to help answer the user’s question. You can invoke tools by writing a "`<|DSML|tool_calls>`" block like the following: `<|DSML|tool_calls>` `<|DSML|invoke name="$TOOL_NAME">` `<|DSML|parameter name="$PARAMETER_NAME" string="true|false">`$PARAMETER_VALUE `</|DSML|parameter>` ... `</|DSML|invoke>` `<|DSML|invoke name="$TOOL_NAME2">` ... `</|DSML|invoke>` `</|DSML|tool_calls>` String parameters should be specified as is and set ‘string="true"‘. For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set ‘string="false"‘. If thinking_mode is enabled (triggered by `<think>`), you MUST output your complete reasoning inside `<think>`...`</think>` BEFORE any tool calls or final response. Otherwise, output directly after `</think>` with tool calls or final response. ### Available Tool Schemas {Tool Definition...} You MUST strictly follow the above definedtool name and parameter schemas to invoke tool calls.
1462:## References
1490:## A. Author List and Acknowledgment
1492:### A.1 Author List
1498:### A.2 Acknowledgment
1502:## B. Evaluation DetailsI'm noticing that the section headings have disappeared from the output, though the ## Tools heading is still there. The problem seems to be that when I wrap Table 4 in a code block, the layout extraction is picking up lines starting with ## Tools from the schema content—but those should be protected inside the code block and not parsed as markdown. Let me trace through the markdown output to see what's happening around that section.
I'm noticing the ## Tools heading is appearing after the closing code fence when it shouldn't be there—it looks like my filtering logic isn't catching this stray paragraph, and it should either be swallowed entirely or wrapped into the table structure above it.
Looking at how the table extraction works in layout mode, I'm capturing Table 4 from its caption all the way until the next figure or table caption appears, but that's pulling in prose that comes after the table ends. The real issue is that I need a better way to detect where a table actually terminates—not just by looking for the next caption, but by spotting when a left-aligned paragraph starts at column 0, which signals the end of the indented table region. Some tables might end at page breaks instead, so I need to handle both cases.
Refining the detection logic: a table ends when I encounter a line at column 0 that isn't a caption keyword and looks like prose—something with enough words and sentence-ending punctuation. I'm updating the extraction function to scan for that first non-indented, prose-like line after the table content, which will properly separate Table 4 from the paragraph about the DeepSeek-V4 series that follows it.
cd /Users/kiran/Downloads/pdf2kindle && python3 build_deepseek_v4.py /Users/kiran/Downloads/DeepSeek_V4.pdf -o deepseek-v4.md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" 2>&1 | tail -5 && rg -n "^#" deepseek-v4.md | head -80Wrote deepseek-v4.md
7:# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
619:## Tools You have access to a set of tools to help answer the user’s question. You can invoke tools by writing a "`<|DSML|tool_calls>`" block like the following: `<|DSML|tool_calls>` `<|DSML|invoke name="$TOOL_NAME">` `<|DSML|parameter name="$PARAMETER_NAME" string="true|false">`$PARAMETER_VALUE `</|DSML|parameter>` ... `</|DSML|invoke>` `<|DSML|invoke name="$TOOL_NAME2">` ... `</|DSML|invoke>` `</|DSML|tool_calls>` String parameters should be specified as is and set ‘string="true"‘. For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set ‘string="false"‘. If thinking_mode is enabled (triggered by `<think>`), you MUST output your complete reasoning inside `<think>`...`</think>` BEFORE any tool calls or final response. Otherwise, output directly after `</think>` with tool calls or final response. ### Available Tool Schemas {Tool Definition...} You MUST strictly follow the above definedtool name and parameter schemas to invoke tool calls.
988:## References
1016:## A. Author List and Acknowledgment
1018:### A.1 Author List
1024:### A.2 Acknowledgment
1028:## B. Evaluation DetailsI'm noticing the numbered section headings aren't appearing in the output—even basic ones like the introduction and architecture sections. Let me check what's happening around line 600 to see where they're getting lost.
I see the issue now—the paragraph splitting logic isn't accounting for section headers that appear on their own line. When a header like "1. Introduction" sits directly above body text without a blank line separator, they're being treated as a single paragraph instead of distinct sections. I need to adjust the splitting function to recognize numbered headers as natural break points.
cd /Users/kiran/Downloads/pdf2kindle && sed -n '492,500p' /tmp/dsv4_raw.txt3 1. Introduction The emergence of reasoning models (DeepSeek-AI, 2025; OpenAI, 2024c) has established a new paradigm of test-time scaling, driving substantial performance gains for Large Language Models (LLMs). However, this scaling paradigm is fundamentally constrained by the quadratic computational complexity of the vanilla attention mechanism (Vaswani et al., 2017), which creates a prohibitive bottleneck for ultra-long contexts and reasoning processes. Concurrently, the emergence of long-horizon scenarios and tasks — from complex agentic workflows to
I'm noticing that "1. Introduction" isn't triggering a paragraph split even though it should be recognized as a heading. Looking back at my current logic, I'm checking if a line matches KNOWN_HEADINGS, but that condition seems to be failing for this case. Let me trace through what's happening in the split_paragraphs function to see why the heading detection isn't working.
cd /Users/kiran/Downloads/pdf2kindle && python3 build_deepseek_v4.py /Users/kiran/Downloads/DeepSeek_V4.pdf -o deepseek-v4.md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" 2>&1 | tail -5 && rg -n "^#" deepseek-v4.md | head -80Wrote deepseek-v4.md
7:# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
10:## 1. Introduction
34:## 2. Architecture
40:### 2.1 Designs Inherited from DeepSeek-V3
44:### 2.2 Manifold-Constrained Hyper-Connections
166:### 2.3 Hybrid Attention with CSA and HCA
170:#### 2.3.1 Compressed Sparse Attention
296:#### 2.3.2 Heavily Compressed Attention
342:#### 2.3.3 Other Details
348:#### 2.3.4 Efficiency Discussion
354:### 2.4 Muon Optimizer
362:## 3. General Infrastructures
364:### 3.1 Fine-Grained Communication-Computation Overlap in Expert Parallelism
420:### 3.2 Flexible and Efficient Kernel Development with TileLang
428:### 3.3 High-Performance Batch-Invariant and Deterministic Kernel Libraries
434:### 3.4 FP4 Quantization-Aware Training
440:### 3.5 Training Framework
444:#### 3.5.1 Efficient Implementation of Muon
448:#### 3.5.2 Cost-Effective and Memory-Efficient Implementation of mHC
452:#### 3.5.3 Contextual Parallelism for Long-Context Attention
456:#### 3.5.4 Extended Automatic Differentiation for Flexible Activation Checkpointing
462:### 3.6 Inference Framework
466:#### 3.6.1 KV Cache Structure and Management
538:#### 3.6.2 On-Disk KV Cache Storage
544:## 4. Pre-Training
546:### 4.1 Data Construction
552:### 4.2 Pre-Training Setups
554:#### 4.2.1 Model Setups
558:#### 4.2.2 Training Setups
564:#### 4.2.3 Mitigating Training Instability
570:### 4.3 Evaluations
572:#### 4.3.1 Evaluation Benchmarks
628:#### 4.3.2 Evaluation Results
634:## 5. Post-Training
636:### 5.1 Post-Training Pipeline
640:#### 5.1.1 Specialist Training
680:## 2. <think> thinking
741:## Tools You have access to a set of tools to help answer the user’s question. You can invoke tools by writing a "`<|DSML|tool_calls>`" block like the following: `<|DSML|tool_calls>` `<|DSML|invoke name="$TOOL_NAME">` `<|DSML|parameter name="$PARAMETER_NAME" string="true|false">`$PARAMETER_VALUE `</|DSML|parameter>` ... `</|DSML|invoke>` `<|DSML|invoke name="$TOOL_NAME2">` ... `</|DSML|invoke>` `</|DSML|tool_calls>` String parameters should be specified as is and set ‘string="true"‘. For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set ‘string="false"‘. If thinking_mode is enabled (triggered by `<think>`), you MUST output your complete reasoning inside `<think>`...`</think>` BEFORE any tool calls or final response. Otherwise, output directly after `</think>` with tool calls or final response. ### Available Tool Schemas {Tool Definition...} You MUST strictly follow the above definedtool name and parameter schemas to invoke tool calls.
751:#### 5.1.2 On-Policy Distillation
818:### 5.2 RL and OPD Infrastructures
822:#### 5.2.1 FP4 Quantization Integration
826:#### 5.2.2 Efficient Teacher Scheduling for Full-Vocabulary OPD
830:#### 5.2.3 Preemptible and Fault-Tolerant Rollout Service
836:#### 5.2.4 Scaling RL Framework for Million-Token Context
840:#### 5.2.5 Sandbox Infrastructure for Agentic AI
846:### 5.3 Standard Benchmark Evaluation
848:#### 5.3.1 Evaluation Setup
856:#### 5.3.2 Evaluation Results
1020:### 5.4 Performance on Real-World Tasks
1024:#### 5.4.1 Chinese Writing
1028:#### 5.4.2 Search
1032:#### 5.4.3 White-Collar Task
1138:#### 5.4.4 Code Agent
1158:## 6. Conclusion, Limitations, and Future Directions
1164:## References
1192:## A. Author List and Acknowledgment
1194:### A.1 Author List
1200:### A.2 Acknowledgment
1204:## B. Evaluation DetailsI see there are still a couple of false positives in the heading detection. The issue is that Table 2 contains a numbered example in its "Response Format" column that starts with "2. <think> thinking", and my heading detection is picking this up as a real heading because it matches the pattern of a numbered item. The problem is that the detection logic is running on the split paragraphs from the raw body text, which doesn't distinguish between actual markdown headings and content that just happens to look like one when it appears in a table.
I'm noticing that my filtering logic isn't catching all the duplicate table content—the raw paragraph text from Table 2 is still leaking through even though it should be filtered out as noise. There's also a malformed heading appearing from the table's raw data that I need to clean up.
The real issue is that I'm only swallowing noise paragraphs immediately after captions, but the raw extraction contains fragmented table content scattered throughout that I've already emitted in layout mode. I need a better boundary detection—instead of trying to identify where the raw table region ends semantically, I should skip forward by counting the paragraphs that the layout region already covered, or use the line positions to align them properly.
The cleanest approach is to have the table extraction also return the first prose line that appears after each table, then use that as a marker in the raw text to know where to stop skallowing fragments. I'll modify the table extraction to capture the opening 60 characters of the next prose line as a boundary prefix, then in the main processor, after emitting a table, skip all raw paragraphs until I encounter that boundary text.
Defining the data structure and implementing the boundary detection...
The tricky part is handling Unicode characters and ensuring the prefix match works even when raw paragraphs are reflowed differently than the layout version, but the boundary line should still start the same way.
cd /Users/kiran/Downloads/pdf2kindle && python3 build_deepseek_v4.py /Users/kiran/Downloads/DeepSeek_V4.pdf -o deepseek-v4.md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" 2>&1 | tail -5 && rg -n "^#" deepseek-v4.md | head -80Wrote deepseek-v4.md 7:# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence 10:## 1. Introduction 34:## 2. Architecture 40:### 2.1 Designs Inherited from DeepSeek-V3 44:### 2.2 Manifold-Constrained Hyper-Connections 166:### 2.3 Hybrid Attention with CSA and HCA 170:#### 2.3.1 Compressed Sparse Attention 296:#### 2.3.2 Heavily Compressed Attention 342:#### 2.3.3 Other Details 348:#### 2.3.4 Efficiency Discussion 354:### 2.4 Muon Optimizer 362:## 3. General Infrastructures 364:### 3.1 Fine-Grained Communication-Computation Overlap in Expert Parallelism 420:### 3.2 Flexible and Efficient Kernel Development with TileLang 428:### 3.3 High-Performance Batch-Invariant and Deterministic Kernel Libraries 434:### 3.4 FP4 Quantization-Aware Training 440:### 3.5 Training Framework 444:#### 3.5.1 Efficient Implementation of Muon 448:#### 3.5.2 Cost-Effective and Memory-Efficient Implementation of mHC 452:#### 3.5.3 Contextual Parallelism for Long-Context Attention 456:#### 3.5.4 Extended Automatic Differentiation for Flexible Activation Checkpointing 462:### 3.6 Inference Framework 466:#### 3.6.1 KV Cache Structure and Management 538:#### 3.6.2 On-Disk KV Cache Storage 544:## 4. Pre-Training 546:### 4.1 Data Construction 552:### 4.2 Pre-Training Setups 554:#### 4.2.1 Model Setups 558:#### 4.2.2 Training Setups 564:#### 4.2.3 Mitigating Training Instability 570:### 4.3 Evaluations 572:#### 4.3.1 Evaluation Benchmarks 582:#### 4.3.2 Evaluation Results 588:## 5. Post-Training 590:### 5.1 Post-Training Pipeline 594:#### 5.1.1 Specialist Training 620:## 2. <think> thinking 689:#### 5.1.2 On-Policy Distillation 716:### 5.2 RL and OPD Infrastructures 720:#### 5.2.1 FP4 Quantization Integration 724:#### 5.2.2 Efficient Teacher Scheduling for Full-Vocabulary OPD 728:#### 5.2.3 Preemptible and Fault-Tolerant Rollout Service 734:#### 5.2.4 Scaling RL Framework for Million-Token Context 738:#### 5.2.5 Sandbox Infrastructure for Agentic AI 744:### 5.3 Standard Benchmark Evaluation 746:#### 5.3.1 Evaluation Setup 754:#### 5.3.2 Evaluation Results 774:### 5.4 Performance on Real-World Tasks 778:#### 5.4.1 Chinese Writing 782:#### 5.4.2 Search 786:#### 5.4.3 White-Collar Task 892:#### 5.4.4 Code Agent 912:## 6. Conclusion, Limitations, and Future Directions 918:## References 946:## A. Author List and Acknowledgment 948:### A.1 Author List 954:### A.2 Acknowledgment 958:## B. Evaluation Details
There's still a bogus heading at line 620 that slipped through—the Table 2 prefix matching didn't catch it. I need to trace back where Table 2's next_prose_prefix came from to figure out why the filtering failed.
cd /Users/kiran/Downloads/pdf2kindle && python3 -c "
import sys
sys.path.insert(0, '.')
from build_deepseek_v4 import extract_tables_layout
from pathlib import Path
tables = extract_tables_layout(Path('/Users/kiran/Downloads/DeepSeek_V4.pdf'))
for n in sorted(tables):
print(n, repr(tables[n].next_prose_prefix))
"1 'Pro-Base. All models are evaluated in ou'
2 'Table 3 | Instruction injected into the '
3 'model leverages its own logic to general'
4 'window of DeepSeek-V4 series, we further'
5 '{𝜋𝐸1 , 𝜋𝐸2 , . . . , 𝜋𝐸 𝑁 }, the OPD obj'
6 '"xHigh", and "High" denote reasoning eff'
7 '"High", and "Max" denote reasoning effor'
8 'using DeepSeek-V4-Pro for agentic coding'
9 'Table 10 | Cost Comparison:Agentic Searc'
10 'DeepSeek-V4-Pro. Most of the tool calls '
11 'Category Subcategory '
12 'Category Subcategory '
13 'Subcategory (文体) # DS Ge'
14 ''I'm noticing an issue with how Table 1's caption is being parsed — the layout extraction is treating wrapped caption lines as prose because they lack indentation, when actually the caption text extends further than expected. Let me check the layout output to see where Table 1 actually ends.
Looking at the other tables, Table 2 correctly ends when Table 3's caption appears, Table 3 and Table 4 have the right prefixes, and Tables 11-13 are picking up header rows from the appendix which I'm handling separately anyway. The real problem is with my parsing order — I'm checking for headings before applying the swallow mode logic, so a heading that appears before a table caption gets emitted even when it should be suppressed.
I need to restructure the loop so that when I'm in table-swallowing mode, I check if the current paragraph matches the prefix for the next table first. If it does, I exit swallow mode and process it normally. If it doesn't match, I skip it entirely. Only after handling the swallow check should I process headings and captions. This way, content trapped between tables won't accidentally get emitted as headings.
Restructuring the parsing loop...
cd /Users/kiran/Downloads/pdf2kindle && python3 build_deepseek_v4.py /Users/kiran/Downloads/DeepSeek_V4.pdf -o deepseek-v4.md --title "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" --author "DeepSeek-AI" 2>&1 | tail -3 && rg -n "^#" deepseek-v4.md | head -80Wrote deepseek-v4.md 7:# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence 10:## 1. Introduction 34:## 2. Architecture 40:### 2.1 Designs Inherited from DeepSeek-V3 44:### 2.2 Manifold-Constrained Hyper-Connections 166:### 2.3 Hybrid Attention with CSA and HCA 170:#### 2.3.1 Compressed Sparse Attention 296:#### 2.3.2 Heavily Compressed Attention 342:#### 2.3.3 Other Details 348:#### 2.3.4 Efficiency Discussion 354:### 2.4 Muon Optimizer 362:## 3. General Infrastructures 364:### 3.1 Fine-Grained Communication-Computation Overlap in Expert Parallelism 420:### 3.2 Flexible and Efficient Kernel Development with TileLang 428:### 3.3 High-Performance Batch-Invariant and Deterministic Kernel Libraries 434:### 3.4 FP4 Quantization-Aware Training 440:### 3.5 Training Framework 444:#### 3.5.1 Efficient Implementation of Muon 448:#### 3.5.2 Cost-Effective and Memory-Efficient Implementation of mHC 452:#### 3.5.3 Contextual Parallelism for Long-Context Attention 456:#### 3.5.4 Extended Automatic Differentiation for Flexible Activation Checkpointing 462:### 3.6 Inference Framework 466:#### 3.6.1 KV Cache Structure and Management 538:#### 3.6.2 On-Disk KV Cache Storage 544:## 4. Pre-Training 546:### 4.1 Data Construction 552:### 4.2 Pre-Training Setups 554:#### 4.2.1 Model Setups 558:#### 4.2.2 Training Setups 564:#### 4.2.3 Mitigating Training Instability 570:### 4.3 Evaluations 572:#### 4.3.1 Evaluation Benchmarks
cd /Users/kiran/Downloads/pdf2kindle && rg -n "^#" deepseek-v4.md | wc -l && rg -c "^#" deepseek-v4.md32 32
cd /Users/kiran/Downloads/pdf2kindle && rg -n "^#" deepseek-v4.md7:# DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence 10:## 1. Introduction 34:## 2. Architecture 40:### 2.1 Designs Inherited from DeepSeek-V3 44:### 2.2 Manifold-Constrained Hyper-Connections 166:### 2.3 Hybrid Attention with CSA and HCA 170:#### 2.3.1 Compressed Sparse Attention 296:#### 2.3.2 Heavily Compressed Attention 342:#### 2.3.3 Other Details 348:#### 2.3.4 Efficiency Discussion 354:### 2.4 Muon Optimizer 362:## 3. General Infrastructures 364:### 3.1 Fine-Grained Communication-Computation Overlap in Expert Parallelism 420:### 3.2 Flexible and Efficient Kernel Development with TileLang 428:### 3.3 High-Performance Batch-Invariant and Deterministic Kernel Libraries 434:### 3.4 FP4 Quantization-Aware Training 440:### 3.5 Training Framework 444:#### 3.5.1 Efficient Implementation of Muon 448:#### 3.5.2 Cost-Effective and Memory-Efficient Implementation of mHC 452:#### 3.5.3 Contextual Parallelism for Long-Context Attention 456:#### 3.5.4 Extended Automatic Differentiation for Flexible Activation Checkpointing 462:### 3.6 Inference Framework 466:#### 3.6.1 KV Cache Structure and Management 538:#### 3.6.2 On-Disk KV Cache Storage 544:## 4. Pre-Training 546:### 4.1 Data Construction 552:### 4.2 Pre-Training Setups 554:#### 4.2.1 Model Setups 558:#### 4.2.2 Training Setups 564:#### 4.2.3 Mitigating Training Instability 570:### 4.3 Evaluations 572:#### 4.3.1 Evaluation Benchmarks
I'm noticing a problem with the file structure — there are only 32 headings visible, but the rest of the document appears to be missing sections like 4.3.2, 5, and 5.1. It looks like the swallow_table mode for Table 1 might have consumed everything after it, since the table's prefix text doesn't appear in the raw file, causing the swallow function to continue all the way to the end and consume all subsequent headings.
Let me look at what Table 1's actual caption says...
cd /Users/kiran/Downloads/pdf2kindle && rg -n -A2 "^Table 1 \|" /tmp/dsv4_raw.txt | head -10(no output)
cd /Users/kiran/Downloads/pdf2kindle && rg -n "Table 1 \|" /tmp/dsv4_raw.txt | head -31958:Table 1 | Comparison among DeepSeek-V3.2-Base, DeepSeek-V4-Flash-Base, and DeepSeek-V4Pro-Base. All models are evaluated in our internal framework and share the same evaluation
cd /Users/kiran/Downloads/pdf2kindle && sed -n '1955,1980p' /tmp/dsv4_raw.txt27 Table 1 | Comparison among DeepSeek-V3.2-Base, DeepSeek-V4-Flash-Base, and DeepSeek-V4Pro-Base. All models are evaluated in our internal framework and share the same evaluation setting. Scores with a gap not exceeding 0.3 are considered to be at the same level. The highest score in each row is in bold font, and the second is underlined. Benchmark (Metric) # Shots Architecture # Activated Params # Total Params - DeepSeek-V3.2 DeepSeek-V4-Flash DeepSeek-V4-Pro Base Base Base MoE 37B 671B MoE 13B
I'm looking at the table caption which describes a comparison across three DeepSeek model variants, noting that scores within 0.3 points are treated as equivalent, with formatting conventions for the highest and second-highest values in each row.
The layout extraction is breaking the caption across multiple lines, and my prose detection is incorrectly terminating the table early because the wrapped caption lines meet the criteria for prose (unindented, multiple words). I need to distinguish between caption text that continues across lines and actual table data. The key insight is that captions end at a blank line, so I should skip past the entire caption block before checking whether we've reached the end of the table.
Refining the approach...
cd /Users/kiran/Downloads/pdf2kindle && pdftotext -layout /Users/kiran/Downloads/DeepSeek_V4.pdf - | rg -n "^Table 1 \|" -A3 | head -10(no output)
cd /Users/kiran/Downloads/pdf2kindle && pdftotext -layout /Users/kiran/Downloads/DeepSeek_V4.pdf /tmp/dsv4_layout.txt && rg -n "Table 1 \|" /tmp/dsv4_layout.txt | head -31542:Table 1 | Comparison among DeepSeek-V3.2-Base, DeepSeek-V4-Flash-Base, and DeepSeek-V4-
cd /Users/kiran/Downloads/pdf2kindle && sed -n '1540,1555p' /tmp/dsv4_layout.txt27
Table 1 | Comparison among DeepSeek-V3.2-Base, DeepSeek-V4-Flash-Base, and DeepSeek-V4-
Pro-Base. All models are evaluated in our internal framework and share the same evaluation
setting. Scores with a gap not exceeding 0.3 are considered to be at the same level. The highest
score in each row is in bold font, and the second is underlined.
DeepSeek-V3.2 DeepSeek-V4-Flash DeepSeek-V4-Pro
Benchmark (Metric) # Shots
Base Base Base
Architecture - MoE MoE MoE
# Activated Params - 37B 13B 49B
# Total Params - 671B 284B 1.6T
AGIEval (EM) 0-shot 80.1 82.6 83.1
MMLU (EM) 5-shot 87.8 88.7 90.1
MMLU-Redux (EM) 5-shot 87.5 89.4 90.8I'm refining the parsing logic for table detection. The key insight is that a table block starts with a multi-line caption at column 0, followed by a blank line, then indented table data. The block ends when we hit the first column-0 prose line after a blank line. So I need to walk through the text, track whether we've seen at least one blank line, and only then check for the end condition.
Implementing the table termination logic...