I'd like to do an experiment with using postgresql as the backend for git, two different methods to try: postgres extension or as pure sql, context: Sure. Here's where we got to:
Starting point: You asked whether Postgres could be used as a git backend. I gave a standard answer about libgit2 pluggable backends and performance tradeoffs.
The reframe: You connected it to your blog post about package managers using git as a database. Instead of telling ecosystems to migrate away from git, what if you fix git's storage layer by putting a real database underneath? The interface stays git, the backend becomes queryable.
Key insight: "Pull requests to a database." You keep git's collaboration model (branching, diffing, review, audit trail) but gain everything a database offers (point queries, indexing, replication).
Materialized views: Once objects are in Postgres, you can skip the git protocol entirely for optimized reads. Historical package versions become trivial queries. Reproducible builds become a SQL query against a point-in-time view.
Homebrew as the example: Mirror homebrew-core into Postgres-backed git storage, serve both the git protocol and the JSON API from the same database. No dual-system maintenance, no sync issues.
Postgres ecosystem: LISTEN/NOTIFY for real-time updates, logical replication for mirrors, foreign data wrappers for cross-registry joins, row-level security for private packages. Registries keep rebuilding these features badly.
Forgejo simplification: If repos live in Postgres alongside everything else, deploying a forge becomes "give me a Postgres connection string." One database, one backup, no filesystem to manage.
Extension vs pure SQL: A CREATE EXTENSION git would be elegant but kills managed Postgres deployability. Better to start with pure SQL tables and PL/pgSQL functions that work everywhere, then optionally build the C extension for performance.
Prior art found: libgit2-backends (SQLite, Redis, MySQL backends exist), git_fdw (read-only SQL queries over git repos), gitbase (SQL interface from source{d}, now defunct), MergeStat (SQLite variant). Nobody's built the write path — where git push lands objects directly in the database.
Next step: Claude Code session to build both the pure SQL and extension versions, using homebrew-core as the test case. Blog post as the third in a trilogy after "package managers shouldn't use git as a database" and the refs post. https://github.com/Homebrew/homebrew-core would make a good test once it's working (don't start with such a big repo though)
Now let me compile all findings. I have enough data from reading the source code of all these projects.
---
## 1. libgit2-backends (SQLite, MySQL, Redis, Memcached)
**Source:** `/tmp/libgit2-backends/` -- [GitHub repo](https://github.com/libgit2/libgit2-backends)
These are pluggable backends for libgit2's Object Database (ODB) and, in the Redis case, also the Reference Database (refdb). They replace git's on-disk loose/packed object storage with a database.
**The interface** is libgit2's `git_odb_backend` C struct. You implement six functions: `read`, `read_header`, `read_prefix`, `write`, `exists`, `free`. The Redis backend additionally implements `git_refdb_backend` (exists, lookup, write, delete, rename, iterate).
**SQLite schema** -- single table `git2_odb`:
```sql
CREATE TABLE git2_odb (
oid CHARACTER(20) PRIMARY KEY NOT NULL, -- raw 20-byte SHA1
type INTEGER NOT NULL, -- git object type (blob/tree/commit/tag)
size INTEGER NOT NULL,
data BLOB
);
```
**MySQL schema** -- identical structure but with MySQL types:
```sql
CREATE TABLE git2_odb (
oid BINARY(20) NOT NULL DEFAULT '' PRIMARY KEY,
type TINYINT(1) UNSIGNED NOT NULL,
size BIGINT(20) UNSIGNED NOT NULL,
data LONGBLOB NOT NULL,
KEY type (type),
KEY size (size)
) ENGINE=InnoDB;
```
A notable MySQL detail: data is stored with `COMPRESS(?)` on write and `UNCOMPRESS(data)` on read, using MySQL's built-in zlib compression. The SQLite backend stores blobs raw.
**Redis key scheme** -- uses Redis hashes:
- ODB objects: `{prefix}:{repo_path}:odb:{hex_oid}` with fields `type`, `size`, `data`
- Refs: `{prefix}:{repo_path}:refdb:{ref_name}` with fields `type`, `target`
Redis refs store either an OID string (for direct refs) or a symbolic target name.
**Memcached key scheme** -- uses three separate keys per object:
- `{raw_oid_bytes}:type`
- `{raw_oid_bytes}:size`
- `{raw_oid_bytes}:data`
**What worked:** The single-table ODB design is the simplest possible mapping. It's a pure key-value store where the key is the SHA1 and the value is the typed blob. That's all git's object model needs at the storage layer.
**What didn't:** These backends are abandoned. The last meaningful commit was years ago, and they depend on old libgit2 API versions. The SQLite and MySQL backends don't implement `read_prefix` (looking up objects by abbreviated hash), which is needed for real git usage. The Redis backend doesn't implement reflog at all (all reflog methods return `GIT_ERROR`). None of them handle packfile semantics -- they're strictly object-level.
**What to steal:** The schema design is worth studying for its minimalism. A single table with `(oid, type, size, data)` is the correct lowest layer for a content-addressable object store in SQL. MySQL's use of `COMPRESS/UNCOMPRESS` is smart. The type column (integer enum for blob/tree/commit/tag) allows filtering by object type without deserializing. Redis's namespaced key scheme (`prefix:repo:odb:hash`) is a clean pattern for multi-repo support.
---
## 2. git_fdw (PostgreSQL Foreign Data Wrapper)
**Source:** `/tmp/git_fdw/git_fdw.c` -- [GitHub repo](https://github.com/franckverrot/git_fdw)
This is a Postgres FDW that exposes a git repo as a read-only foreign table. It uses libgit2 under the hood.
**Schema:** Each foreign table maps to one repository on one branch and has exactly these columns:
```sql
CREATE FOREIGN TABLE repository (
sha1 text,
message text,
name text, -- author name
email text, -- author email
commit_date timestamp with time zone,
insertions int,
deletions int,
files_changed int
) SERVER git_fdw_server
OPTIONS (path '/path/to/repo.git', branch 'refs/heads/master');
```
The FDW walks commits using `git_revwalk` in topological order. For each commit, it computes a diff against the parent to get insertions/deletions/files_changed. This means every query does a full walk of the commit history. There's no caching, no index support, and no pushdown of WHERE clauses.
**What worked:** The SQL interface is natural. `SELECT message, name FROM rails_repository LIMIT 10` reads exactly like you'd want it to. Using `IMPORT FOREIGN SCHEMA` (Postgres 9.5+) gives a nice auto-setup path.
**What didn't:** Performance is terrible by design. Computing diffs for every commit on every query is expensive. The table is commit-only -- no access to refs, trees, blobs, or tags. You can't query across repos without creating separate foreign tables and using UNION. There's no filter pushdown, so `WHERE sha1 = 'abc...'` still walks the entire history. The project requires libgit2 v0.27, which is ancient. It hasn't been updated since ~2020.
**What to steal:** The column design for commits is good: sha1, message, author name/email, date, plus diff stats (insertions, deletions, files_changed). The diff stats per commit are genuinely useful and something most other tools skip at the table level.
---
## 3. gitbase (source{d})
**Source:** `/tmp/gitbase/` -- [GitHub repo](https://github.com/src-d/gitbase)
Gitbase is a MySQL-wire-protocol server written in Go that presents git repos as SQL tables. It reads from real git repos on disk (no ETL step). It was the most fully-realized attempt at SQL-over-git.
**Tables (11 total):**
| Table | Columns |
|-------|---------|
| `repositories` | `repository_id` |
| `refs` | `repository_id`, `ref_name`, `commit_hash` |
| `commits` | `repository_id`, `commit_hash`, `commit_author_name`, `commit_author_email`, `commit_author_when`, `committer_name`, `committer_email`, `committer_when`, `commit_message`, `tree_hash`, `commit_parents` (array) |
| `blobs` | `repository_id`, `blob_hash`, `blob_size`, `blob_content` |
| `tree_entries` | `repository_id`, `tree_entry_name`, `blob_hash`, `tree_hash`, `tree_entry_mode` |
| `files` | `repository_id`, `file_path`, `blob_hash`, `tree_hash`, `tree_entry_mode`, `blob_content`, `blob_size` |
| `remotes` | `repository_id`, `remote_name`, `remote_push_url`, `remote_fetch_url`, `remote_push_refspec`, `remote_fetch_refspec` |
| `ref_commits` | `repository_id`, `commit_hash`, `ref_name`, `history_index` |
| `commit_trees` | `repository_id`, `commit_hash`, `tree_hash` |
| `commit_blobs` | `repository_id`, `commit_hash`, `blob_hash` |
| `commit_files` | `repository_id`, `commit_hash`, `file_path`, `blob_hash`, `tree_hash`, `blob_content`, `blob_size` |
**Key design decisions:**
- Every table has `repository_id` as the first column, supporting multi-repo queries.
- Hashes are stored as `VARCHAR(40)` (hex strings).
- The core git objects (commits, blobs, tree_entries, refs) each get their own table.
- Relationship tables (`ref_commits`, `commit_trees`, `commit_blobs`, `commit_files`) are materialized as virtual tables. These represent the graph walk: which commits belong to which refs, which trees/blobs belong to which commits.
- The `files` table is a convenience join: it combines tree_entries with blob content, giving you file paths and contents.
- `commit_parents` is an array column on the commits table.
- Blob content is lazily loaded and has configurable max size (`GITBASE_BLOBS_MAX_SIZE`, default 5 MiB) and binary filtering (`GITBASE_BLOBS_ALLOW_BINARY`).
- Table partitioning by repository for parallel scanning.
- Index support: tables implement `IndexKeyValues` for creating indexes, using packfile offsets as keys for efficient lookups.
- "Squash" optimization: gitbase could combine multiple table scans into a single git walk when it recognized certain join patterns.
**What worked:** The schema design is the best of the bunch. Separating object tables from relationship tables lets you query at whatever level of abstraction you need. `ref_commits` with `history_index` is particularly clever -- it gives you the ordinal position of each commit within a ref's history, so you can write `WHERE history_index = 0` to get branch tips. The squash optimizer was a good idea for making joins between git tables fast.
**What didn't:** source{d} the company went out of business around 2020, and the project died with it. It was hard to set up (Docker-only in practice). Reading directly from packfiles at query time was slow for large repos. The squash optimizer only recognized specific join patterns and was brittle. Using MySQL wire protocol meant no Postgres ecosystem integration. The `commit_files` and `commit_blobs` virtual tables could produce massive result sets for repos with long histories.
**What to steal:** The full table list. The relationship tables (ref_commits, commit_trees, commit_blobs). The `history_index` column on ref_commits. The `repository_id` on every table. The separation of `tree_entries` from `files`. The blob size limits and binary content filtering. This is the schema to start from.
---
## 4. MergeStat
Two projects: [mergestat-lite](https://github.com/mergestat/mergestat-lite) (CLI, SQLite virtual tables) and [mergestat](https://github.com/mergestat/mergestat) (server, Postgres).
**mergestat-lite** is a CLI that uses SQLite virtual tables to query git repos. The virtual tables:
- `commits`: hash, message, author_name, author_email, author_when, committer_name, committer_email, committer_when, parents (count), plus hidden columns `repository` and `ref`
- `refs`: name, type, remote, full_name, hash, target, plus hidden `repository` and `tag`
- `stats`: file_path, additions, deletions, old_file_mode, new_file_mode, plus hidden `repository`, `rev`, `to_rev`
- `files`: path, executable, contents, plus hidden `repository` and `rev`
- `blame`: (not fully examined but follows similar pattern)
The "hidden" columns are how you pass the repo path and revision into the table-valued function. Example: `SELECT * FROM commits('path/to/repo')` or `SELECT * FROM commits WHERE repository = '/path/to/repo'`.
**mergestat (server)** uses Postgres with an ETL approach. A sync worker clones repos, extracts data, and writes to Postgres tables. The Postgres schema (from the init migration):
```sql
-- Core tables
repos (id uuid PK, repo text, ref text, is_github boolean, ...)
git_commits (repo_id uuid, hash text, message text, author_name, author_email, author_when, committer_name, committer_email, committer_when, parents integer)
PK: (repo_id, hash)
git_refs (repo_id uuid, full_name text, hash text, name text, remote text, target text, type text, tag_commit_hash text)
PK: (repo_id, full_name)
git_commit_stats (repo_id uuid, commit_hash text, file_path text, additions int, deletions int)
git_files (repo_id uuid, path text, executable boolean, contents text)
PK: (repo_id, path)
-- Views
git_branches: WHERE type = 'branch'
git_tags: WHERE type = 'tag'
-- Also GitHub-specific tables
github_issues, github_pull_requests, github_repo_info, github_stargazers,
github_pull_request_reviews, github_pull_request_commits
```
**Key design decisions:**
- ETL model: sync data into Postgres rather than querying git on the fly. This gives you real indexes, real query planning, and real SQL.
- `git_commit_stats` has per-file diffs (file_path, additions, deletions) linked to commit_hash. This is one of the most useful tables for analysis.
- Views for branches vs tags (filtering `git_refs` by type).
- Composite primary keys: `(repo_id, hash)` for commits, `(repo_id, full_name)` for refs.
- No blob/tree storage at all in the Postgres schema -- only commits, refs, files, and stats.
- The `parents` column on `git_commits` is just an integer count, not actual parent hashes.
**What worked:** The ETL approach means queries are fast -- just Postgres doing what Postgres does best. The `git_commit_stats` table with per-file diffs is genuinely useful for code analysis. Having both a CLI (mergestat-lite for ad-hoc) and a server (mergestat for persistent) covers two distinct use cases. Mailmap support in mergestat-lite is a nice touch.
**What didn't:** MergeStat the company pivoted and the open source project went quiet. The ETL approach means data is always stale unless you re-sync. The sync infrastructure is complex (queue tables, job scheduling, status tracking). No tree or blob content in the Postgres schema limits what you can query. The `parents` integer on commits means you can't traverse the commit graph in SQL.
**What to steal:** The `git_commit_stats` table (per-file additions/deletions per commit). The `git_refs` schema with `type` column and branch/tag views. The ETL-into-Postgres approach works well when you want real query performance. The composite primary key pattern `(repo_id, hash)`.
---
## 5. Other Projects
### Gitana
**Source:** `/tmp/Gitana/` -- [GitHub repo](https://github.com/SOM-Research/Gitana)
An academic project (research paper: "Gitana: A SQL-Based Git Repository Inspector") that imports git data into MySQL. Python-based ETL.
**Schema (MySQL):**
- `project` (id, name)
- `repository` (id, project_id, name)
- `reference` (id, repo_id, name, type)
- `commit` (id, repo_id, sha VARCHAR(512), message, author_id FK, committer_id FK, authored_date, committed_date, size)
- `commit_parent` (repo_id, commit_id, commit_sha, parent_id, parent_sha)
- `commit_in_reference` (repo_id, commit_id, ref_id)
- `file` (id, repo_id, name, ext)
- `file_renamed` (repo_id, current_file_id, previous_file_id, file_modification_id)
- `file_modification` (id, commit_id, file_id, status, additions, deletions, changes, patch LONGBLOB)
- `line_detail` (file_modification_id, type, line_number, is_commented, is_partially_commented, is_empty, content)
- `user` (id, name, email) with `user_alias` for identity dedup
- `file_dependency` (repo_id, ref_id, source_file_id, target_file_id)
**What's interesting:** Gitana normalizes authors into a separate `user` table with an `user_alias` system for mapping different name/email pairs to the same person. It stores `file_modification` with a `status` column (added/modified/deleted/renamed) and the actual patch as a LONGBLOB. The `line_detail` table tracks per-line metadata including whether lines are commented. The `file_renamed` table explicitly tracks rename history. The `file_dependency` table tracks import/include relationships between files.
**What to steal:** The `user` + `user_alias` normalization. The `file_modification` with status and patch blob. The explicit `commit_parent` join table (rather than storing parent hashes as an array). The file rename tracking.
### GitQLite
[GitHub](https://github.com/augmentable-dev/gitqlite) -- now merged into mergestat-lite. Same approach: SQLite virtual tables, Go + go-git. This was the predecessor to mergestat-lite.
### Dolt
[GitHub](https://github.com/dolthub/dolt) -- not a tool for querying git repos, but a SQL database that works *like* git. It uses "prolly trees" (probabilistic B-trees) stored in a content-addressed block store. Every chunk of data is hashed, and chunks with the same hash are stored once. This gives structural sharing between versions -- two branches that share most of their data share most of their storage.
**What to steal from Dolt's architecture:** The prolly tree idea that content-addressed chunks can share storage across versions. This is the right way to think about storing multiple snapshots of a database without duplicating unchanged data. Also their commit graph model, where branches/tags/HEAD are just pointers into a DAG of immutable commit objects, each pointing to a root prolly tree.
---
## 6. Postgres Extensions for Hash/SHA Storage
### hashtypes (pandrewhk/hashtypes)
**Source:** `/tmp/hashtypes/` -- [GitHub repo](https://github.com/pandrewhk/hashtypes), [PGXN](https://pgxn.org/dist/hashtypes/)
Originally "shatypes" (2009), provides native Postgres types for SHA-1, SHA-224, SHA-256, SHA-384, SHA-512, MD5, and CRC32. Each type stores the hash in its native binary length (20 bytes for SHA-1, 32 bytes for SHA-256, etc.) with no varlena overhead.
Provides full operator support: `=`, `<>`, `<`, `>`, `<=`, `>=`, plus btree and hash operator classes. Casts to/from `text` and `bytea`. All functions marked `PARALLEL SAFE`.
For SHA-1, the `sha1` type takes 20 bytes of storage vs 40+ bytes for a text hex string. That's a 2x space saving, and comparisons are faster because they compare raw bytes rather than parsing hex.
### sha (PGXN)
[PGXN](https://pgxn.org/dist/sha/) -- similar to hashtypes. Implements sha1, sha224, sha256, sha384, sha512, md5hash datatypes with btree and hash index support.
**What to steal:** Using a native `sha1` type (20 bytes, binary) instead of `VARCHAR(40)` or `TEXT` for git hashes would give 2x space savings and faster comparisons. The btree operator class means these can be used in primary keys and indexes efficiently. If we build a Postgres extension, we could either depend on hashtypes or implement our own sha1/sha256 type with the same approach.
---
## Summary of Lessons
**Schema patterns that keep showing up:**
- Every table has a `repository_id` or `repo_id` for multi-repo support
- Commits always have: hash, message, author (name, email, when), committer (name, email, when)
- Hashes stored as hex text everywhere (but shouldn't be -- use binary)
- Relationship tables for graph traversal: commit_parents, ref_commits, commit_files
**What works:**
- ETL into a real database (MergeStat, Gitana) gives the best query performance
- Reading live from git repos (gitbase, git_fdw) avoids staleness but is slow
- The libgit2-backends single-table ODB design is the right lowest layer for object storage
- gitbase's table design is the most complete and should be the starting point
**What doesn't work:**
- Companies building products around these projects keep going out of business (source{d}, MergeStat)
- Live-query approaches can't compete with Postgres's query planner working on real tables with real indexes
- Storing hashes as VARCHAR(40) wastes space and slows comparisons
- Not having a `commit_parent` relationship table makes graph traversal impossible in SQL
**What to steal for gitgres:**
1. The `hashtypes` extension's `sha1` native type (20 bytes binary with btree/hash ops)
2. gitbase's table decomposition: repositories, refs, commits, blobs, tree_entries, files, plus relationship tables (ref_commits, commit_trees, commit_blobs)
3. gitbase's `history_index` on ref_commits
4. MergeStat's `git_commit_stats` table (per-file additions/deletions per commit)
5. Gitana's `user` + `user_alias` normalization and `commit_parent` join table
6. The libgit2-backends approach for the object store layer: `(oid, type, size, data)` with MySQL-style compression
7. MergeStat's `git_refs.type` column with views for branches/tags
8. Dolt's content-addressed structural sharing idea for efficient multi-version storage
agentId: ae6ddf9cd030aabe6 (for resuming to continue this agent's work if needed)
<usage>total_tokens: 143205
tool_uses: 85
duration_ms: 678863</usage>Excellent research. Now let me design the plan. I have a clear picture of:
Let me now plan the two approaches:
Pure SQL approach: Tables + PL/pgSQL functions that implement git operations. This works on any Postgres including managed.
Extension approach: C extension with native types and functions for performance.
I should build the pure SQL version first since it's more portable. I'm structuring this in layers: an objects table storing the content-addressable store with OID, type, size and data; a refs table tracking branch and tag pointers with symbolic references; and a reflog for append-only history.
Then I need materialized views to make the data queryable—parsed commits, trees, and relationships between them. The core functions would handle hashing objects, reading and writing to the store, updating refs, walking commit history, listing tree contents, and diffing trees, plus import/export utilities. After that I can build the extension version with optimizations.
For testing, I should start small—import a test repo into Postgres, verify the basic operations work, then scale up to something like homebrew-core. Before diving in though, I should clarify the scope and constraints with the user.
Good research. Let me now design the implementation plan.
Now I have all the research I need. Let me synthesize this into a detailed implementation plan.
---
# Gitgres Implementation Plan
## The Core Idea
Store git objects and refs in PostgreSQL tables. Bridge the git protocol so that `git push` writes objects into the database and `git clone`/`fetch` reads them back out. The pure SQL version uses pgcrypto for SHA1 hashing and stores everything as `bytea`. The C extension version adds a native `git_oid` type, faster pack parsing, and zlib decompression inside Postgres.
## Project Structure
```
gitgres/
README.md
LICENSE
Makefile # top-level: delegates to sub-makes
sql/
schema.sql # core tables and indexes
functions/
object_hash.sql # SHA1 computation for git objects
object_read_write.sql # insert/lookup objects
tree_parse.sql # parse binary tree entries
commit_parse.sql # parse commit text format
ref_manage.sql # ref CRUD with CAS
pack_parse.sql # packfile parsing (pure SQL, limited)
pack_build.sql # packfile generation for fetch
pkt_line.sql # pkt-line encoding/decoding helpers
import_export.sql # fast-import/fast-export SQL helpers
views/
queryable.sql # materialized views for querying
seed.sql # test data
bridge/
git-remote-gitgres # Ruby script: git remote helper
Gemfile
lib/
gitgres/
connection.rb # PG connection management
pkt_line.rb # pkt-line read/write
pack_reader.rb # packfile parser (receives push data)
pack_writer.rb # packfile builder (serves fetch data)
ref_advertisement.rb # ref discovery responses
receive_pack.rb # handle git push
upload_pack.rb # handle git clone/fetch
fast_transport.rb # fast-import/fast-export bridge
ext/
gitgres/
Makefile # PGXS makefile
gitgres.c # PG_MODULE_MAGIC, init
git_oid_type.c # custom git_oid type (20-byte binary)
sha1_hash.c # C-level SHA1 (OpenSSL)
zlib_inflate.c # raw zlib inflate for pack objects
pack_parse.c # C-level packfile parser
gitgres.control # extension metadata
sql/
gitgres--0.1.sql # CREATE TYPE, CREATE FUNCTION
test/
test_helper.rb
schema_test.rb # table creation, constraints
object_hash_test.rb # SHA1 computation matches git
object_store_test.rb # write and read objects
tree_parse_test.rb # parse tree binary format
commit_parse_test.rb # parse commit text format
ref_test.rb # ref operations, CAS
pack_parse_test.rb # packfile parsing
pack_build_test.rb # packfile generation
pkt_line_test.rb # pkt-line encoding
remote_helper_test.rb # end-to-end push/clone
integration_test.rb # git push && git clone roundtrip
```
## Phase 1: Schema and Object Store (Pure SQL)
### Tables
```sql
CREATE EXTENSION IF NOT EXISTS pgcrypto;
-- One row per hosted repository
CREATE TABLE repositories (
id serial PRIMARY KEY,
name text NOT NULL UNIQUE,
created_at timestamptz NOT NULL DEFAULT now()
);
-- Git objects: blob, tree, commit, tag
-- Stores the raw, uncompressed content (not the header).
-- The OID is computed as SHA1("<type> <size>\0<content>").
CREATE TABLE objects (
repo_id integer NOT NULL REFERENCES repositories(id),
oid bytea NOT NULL, -- 20 bytes, SHA1
type smallint NOT NULL, -- 1=commit, 2=tree, 3=blob, 4=tag
size integer NOT NULL, -- byte length of content
content bytea NOT NULL, -- raw object content (uncompressed)
PRIMARY KEY (repo_id, oid)
);
-- B-tree index on oid alone for prefix lookups
CREATE INDEX idx_objects_oid ON objects (oid);
-- Git references
CREATE TABLE refs (
repo_id integer NOT NULL REFERENCES repositories(id),
name text NOT NULL, -- e.g. "refs/heads/main"
oid bytea, -- 20 bytes, target SHA1 (NULL for symbolic)
symbolic text, -- target ref name (NULL for direct)
PRIMARY KEY (repo_id, name),
CHECK ((oid IS NOT NULL) != (symbolic IS NOT NULL))
);
-- Reflog entries
CREATE TABLE reflog (
id bigserial PRIMARY KEY,
repo_id integer NOT NULL REFERENCES repositories(id),
ref_name text NOT NULL,
old_oid bytea, -- 20 bytes
new_oid bytea, -- 20 bytes
committer text NOT NULL, -- "Name <email>"
timestamp_s bigint NOT NULL,
tz_offset text NOT NULL, -- e.g. "+0000"
message text,
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX idx_reflog_ref ON reflog (repo_id, ref_name, id);
```
The `objects` table is deliberately flat, matching the libgit2-backends/SQLite approach. One row per object per repo. The OID is stored as 20 raw bytes in `bytea`, not as 40-character hex text. This halves index size and makes comparisons faster.
The type column uses git's internal numbering (1=commit, 2=tree, 3=blob, 4=tag) so pack format type codes map directly.
### Object Hash Function
```sql
-- Compute the git object ID (SHA1 of "<type> <size>\0<content>")
CREATE OR REPLACE FUNCTION git_object_hash(
obj_type smallint,
content bytea
) RETURNS bytea AS $$
DECLARE
type_name text;
header bytea;
BEGIN
type_name := CASE obj_type
WHEN 1 THEN 'commit'
WHEN 2 THEN 'tree'
WHEN 3 THEN 'blob'
WHEN 4 THEN 'tag'
END;
-- Build header: "<type> <size>\0"
header := convert_to(type_name || ' ' || octet_length(content)::text, 'UTF8')
|| '\x00'::bytea;
RETURN digest(header || content, 'sha1');
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT;
```
### Object Write with Hash Verification
```sql
CREATE OR REPLACE FUNCTION git_object_write(
p_repo_id integer,
p_type smallint,
p_content bytea
) RETURNS bytea AS $$
DECLARE
computed_oid bytea;
BEGIN
computed_oid := git_object_hash(p_type, p_content);
INSERT INTO objects (repo_id, oid, type, size, content)
VALUES (p_repo_id, computed_oid, p_type, octet_length(p_content), p_content)
ON CONFLICT (repo_id, oid) DO NOTHING;
RETURN computed_oid;
END;
$$ LANGUAGE plpgsql;
```
### Object Read
```sql
CREATE OR REPLACE FUNCTION git_object_read(
p_repo_id integer,
p_oid bytea
) RETURNS TABLE(type smallint, size integer, content bytea) AS $$
SELECT type, size, content
FROM objects
WHERE repo_id = p_repo_id AND oid = p_oid;
$$ LANGUAGE sql STABLE;
-- Prefix lookup: find object by abbreviated OID
CREATE OR REPLACE FUNCTION git_object_read_prefix(
p_repo_id integer,
p_oid_prefix bytea,
p_prefix_len integer -- in hex chars (4-40)
) RETURNS TABLE(oid bytea, type smallint, size integer, content bytea) AS $$
DECLARE
byte_len integer := p_prefix_len / 2;
BEGIN
-- For even-length prefixes, exact byte prefix match
IF p_prefix_len % 2 = 0 THEN
RETURN QUERY
SELECT o.oid, o.type, o.size, o.content
FROM objects o
WHERE o.repo_id = p_repo_id
AND substring(o.oid FROM 1 FOR byte_len) = p_oid_prefix;
ELSE
-- Odd-length: match full bytes plus upper nibble of next byte
RETURN QUERY
SELECT o.oid, o.type, o.size, o.content
FROM objects o
WHERE o.repo_id = p_repo_id
AND substring(o.oid FROM 1 FOR byte_len) = substring(p_oid_prefix FROM 1 FOR byte_len)
AND (get_byte(o.oid, byte_len) >> 4) = (get_byte(p_oid_prefix, byte_len) >> 4);
END IF;
END;
$$ LANGUAGE plpgsql STABLE;
```
### Ref Management
```sql
-- Ref write with compare-and-swap
CREATE OR REPLACE FUNCTION git_ref_update(
p_repo_id integer,
p_name text,
p_new_oid bytea,
p_old_oid bytea DEFAULT NULL, -- NULL means "create only"
p_force boolean DEFAULT false
) RETURNS boolean AS $$
DECLARE
current_oid bytea;
BEGIN
-- Lock the row (or gap) to prevent races
SELECT oid INTO current_oid
FROM refs
WHERE repo_id = p_repo_id AND name = p_name
FOR UPDATE;
IF NOT FOUND THEN
-- Creating new ref
IF p_old_oid IS NOT NULL AND p_old_oid != '\x0000000000000000000000000000000000000000'::bytea THEN
RETURN false; -- Expected existing ref, but none found
END IF;
INSERT INTO refs (repo_id, name, oid) VALUES (p_repo_id, p_name, p_new_oid);
RETURN true;
END IF;
-- Updating existing ref
IF NOT p_force AND p_old_oid IS NOT NULL AND current_oid != p_old_oid THEN
RETURN false; -- CAS failure
END IF;
IF p_new_oid = '\x0000000000000000000000000000000000000000'::bytea THEN
DELETE FROM refs WHERE repo_id = p_repo_id AND name = p_name;
ELSE
UPDATE refs SET oid = p_new_oid, symbolic = NULL
WHERE repo_id = p_repo_id AND name = p_name;
END IF;
RETURN true;
END;
$$ LANGUAGE plpgsql;
```
## Phase 2: Parsing Functions
### Tree Parsing (Pure SQL)
Tree entries are binary: `<mode-ascii> <name>\0<20-byte-sha1>`. Parsing this in PL/pgSQL means walking byte-by-byte through the bytea. It is slow but works.
```sql
-- Parse a tree object into its entries
CREATE OR REPLACE FUNCTION git_tree_entries(
p_tree_content bytea
) RETURNS TABLE(mode text, name text, entry_oid bytea) AS $$
DECLARE
pos integer := 1;
total integer := octet_length(p_tree_content);
space_pos integer;
null_pos integer;
mode_bytes bytea;
name_bytes bytea;
BEGIN
WHILE pos <= total LOOP
-- Find the space separating mode from name
space_pos := position('\x20'::bytea IN substring(p_tree_content FROM pos));
IF space_pos = 0 THEN EXIT; END IF;
space_pos := pos + space_pos - 1;
mode_bytes := substring(p_tree_content FROM pos FOR space_pos - pos);
-- Find the null byte after name
null_pos := position('\x00'::bytea IN substring(p_tree_content FROM space_pos + 1));
IF null_pos = 0 THEN EXIT; END IF;
null_pos := space_pos + null_pos; -- absolute position of \0
name_bytes := substring(p_tree_content FROM space_pos + 1 FOR null_pos - space_pos - 1);
mode := convert_from(mode_bytes, 'UTF8');
name := convert_from(name_bytes, 'UTF8');
entry_oid := substring(p_tree_content FROM null_pos + 1 FOR 20);
pos := null_pos + 21;
RETURN NEXT;
END LOOP;
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT;
```
### Commit Parsing
Commits are text-based (headers separated by newlines, blank line before message body), which makes them easier to parse in SQL.
```sql
CREATE OR REPLACE FUNCTION git_commit_parse(
p_content bytea
) RETURNS TABLE(
tree_oid bytea,
parent_oids bytea[],
author_name text,
author_email text,
author_timestamp bigint,
author_tz text,
committer_name text,
committer_email text,
committer_timestamp bigint,
committer_tz text,
message text
) AS $$
DECLARE
txt text;
lines text[];
line text;
header_end integer;
i integer;
parents bytea[] := '{}';
parts text[];
BEGIN
txt := convert_from(p_content, 'UTF8');
header_end := position(E'\n\n' IN txt);
-- Parse headers
lines := string_to_array(substring(txt FROM 1 FOR header_end - 1), E'\n');
FOR i IN 1..array_length(lines, 1) LOOP
line := lines[i];
IF line LIKE 'tree %' THEN
tree_oid := decode(substring(line FROM 6), 'hex');
ELSIF line LIKE 'parent %' THEN
parents := array_append(parents, decode(substring(line FROM 8), 'hex'));
ELSIF line LIKE 'author %' THEN
-- Parse: "author Name <email> timestamp tz"
parts := regexp_match(substring(line FROM 8),
'^(.+?) <(.+?)> (\d+) ([+-]\d{4})$');
IF parts IS NOT NULL THEN
author_name := parts[1];
author_email := parts[2];
author_timestamp := parts[3]::bigint;
author_tz := parts[4];
END IF;
ELSIF line LIKE 'committer %' THEN
parts := regexp_match(substring(line FROM 11),
'^(.+?) <(.+?)> (\d+) ([+-]\d{4})$');
IF parts IS NOT NULL THEN
committer_name := parts[1];
committer_email := parts[2];
committer_timestamp := parts[3]::bigint;
committer_tz := parts[4];
END IF;
END IF;
END LOOP;
parent_oids := parents;
message := substring(txt FROM header_end + 2);
RETURN NEXT;
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT;
```
## Phase 3: Materialized Queryable Views
Once objects are in Postgres, the whole point is querying them with SQL.
```sql
-- Parsed commits view
CREATE MATERIALIZED VIEW commits_view AS
SELECT
o.repo_id,
o.oid AS commit_oid,
encode(o.oid, 'hex') AS commit_sha,
c.tree_oid,
c.parent_oids,
c.author_name,
c.author_email,
to_timestamp(c.author_timestamp) AS authored_at,
c.committer_name,
c.committer_email,
to_timestamp(c.committer_timestamp) AS committed_at,
c.message
FROM objects o,
LATERAL git_commit_parse(o.content) c
WHERE o.type = 1; -- commit
CREATE UNIQUE INDEX idx_commits_view ON commits_view (repo_id, commit_oid);
-- Parsed trees view (flattened)
CREATE MATERIALIZED VIEW tree_entries_view AS
SELECT
o.repo_id,
o.oid AS tree_oid,
e.mode,
e.name,
e.entry_oid
FROM objects o,
LATERAL git_tree_entries(o.content) e
WHERE o.type = 2; -- tree
CREATE INDEX idx_tree_entries ON tree_entries_view (repo_id, tree_oid);
CREATE INDEX idx_tree_entries_name ON tree_entries_view (repo_id, name);
-- Refs with resolved targets
CREATE VIEW refs_view AS
SELECT
r.repo_id,
rp.name AS repo_name,
r.name AS ref_name,
encode(COALESCE(r.oid, r2.oid), 'hex') AS target_sha,
r.symbolic IS NOT NULL AS is_symbolic,
r.symbolic
FROM refs r
JOIN repositories rp ON rp.id = r.repo_id
LEFT JOIN refs r2 ON r2.repo_id = r.repo_id AND r2.name = r.symbolic;
-- File listing at a given commit (recursive tree walk)
CREATE OR REPLACE FUNCTION git_ls_tree_r(
p_repo_id integer,
p_tree_oid bytea,
p_prefix text DEFAULT ''
) RETURNS TABLE(mode text, path text, entry_oid bytea, obj_type smallint) AS $$
DECLARE
entry RECORD;
sub_content bytea;
BEGIN
FOR entry IN
SELECT e.mode, e.name, e.entry_oid
FROM git_tree_entries(
(SELECT content FROM objects WHERE repo_id = p_repo_id AND oid = p_tree_oid)
) e
LOOP
IF entry.mode = '40000' THEN
-- Subtree: recurse
RETURN QUERY SELECT * FROM git_ls_tree_r(
p_repo_id, entry.entry_oid,
p_prefix || entry.name || '/'
);
ELSE
mode := entry.mode;
path := p_prefix || entry.name;
entry_oid := entry.entry_oid;
obj_type := 3; -- blob
RETURN NEXT;
END IF;
END LOOP;
END;
$$ LANGUAGE plpgsql STABLE;
```
These let you write queries like:
```sql
-- All commits by a given author in the last month
SELECT commit_sha, message, authored_at
FROM commits_view
WHERE repo_id = 1 AND author_email = 'dev@example.com'
AND authored_at > now() - interval '1 month'
ORDER BY authored_at DESC;
-- Files changed between two commits (via tree diff)
-- List all files at HEAD
SELECT path, encode(entry_oid, 'hex')
FROM git_ls_tree_r(1, (SELECT tree_oid FROM commits_view WHERE commit_sha = 'abc...'))
ORDER BY path;
```
## Phase 4: The Git Protocol Bridge
This is the novel and most involved part. The bridge is a Ruby script named `git-remote-gitgres` that implements the git remote helper protocol. It sits on the user's PATH. When a user runs `git push gitgres://mydb/myrepo`, git invokes `git-remote-gitgres gitgres://mydb/myrepo` and communicates over stdin/stdout.
The bridge uses the **`connect` capability**, which means git handles the pack protocol negotiation itself and the bridge just needs to pipe bytes between git's native upload-pack/receive-pack and the database. But since there is no actual git process on the other end, the bridge must *be* upload-pack and receive-pack.
That means the bridge needs to:
**For `git push` (receive-pack side):**
1. Advertise refs from the database (pkt-line format)
2. Receive the list of ref updates (old-oid, new-oid, refname)
3. Receive the packfile
4. Parse the packfile into individual objects
5. Insert each object into the `objects` table
6. Update refs (with CAS using old-oid)
7. Send report-status back
**For `git clone`/`fetch` (upload-pack side):**
1. Advertise refs from the database
2. Receive want/have lines
3. Determine which objects are needed
4. Build a packfile from those objects
5. Send the packfile back
### Approach: Use `fetch`/`push` capabilities instead of `connect`
The `connect` capability requires implementing the full bidirectional pack protocol. A simpler approach: use the `fetch` and `push` capabilities of the remote helper protocol, paired with `git fast-import` and `git fast-export`. This way git handles all the pack/object format work, and the bridge only needs to shuttle text-format streams.
But this has limitations -- `git fast-export` works from a local repo, and we want to serve from the database directly. The cleaner path is to use the remote helper `connect` capability but implement a minimal receive-pack and upload-pack inside the bridge.
Actually, the simplest working approach is even more direct: **implement the HTTP smart protocol** behind a small HTTP server (WEBrick or similar thin Ruby server). Then users add a remote with `git remote add pg http://localhost:9418/myrepo` and push/fetch works with zero custom protocol work -- git's built-in HTTP transport handles everything.
I recommend the HTTP approach for the initial implementation. Here's why:
- The smart HTTP protocol is well-documented and widely tested
- We only need two endpoints: `GET /info/refs` and `POST /git-upload-pack` or `POST /git-receive-pack`
- The packfile parsing/generation code is needed either way
- Debugging is easier (curl, browser, wireshark)
### HTTP Bridge Structure
```ruby
# bridge/lib/gitgres/server.rb
#
# Endpoints:
# GET /:repo/info/refs?service=git-upload-pack
# GET /:repo/info/refs?service=git-receive-pack
# POST /:repo/git-upload-pack
# POST /:repo/git-receive-pack
class Gitgres::Server
def handle_info_refs(repo_name, service)
# Query refs from database
# Format as pkt-line advertisement
# Return with correct content-type
end
def handle_receive_pack(repo_name, request_body)
# Parse command lines (old_oid new_oid refname)
# Parse packfile from request body
# Insert objects into database
# Update refs with CAS
# Return report-status
end
def handle_upload_pack(repo_name, request_body)
# Parse want/have lines
# Walk object graph to find needed objects
# Build packfile
# Return packfile (possibly with side-band-64k framing)
end
end
```
### Packfile Parsing (Ruby, for receive-pack)
The bridge must parse incoming packfiles. This is the hardest part of the write path.
```ruby
# bridge/lib/gitgres/pack_reader.rb
class Gitgres::PackReader
# Parse a packfile stream into individual objects
#
# Packfile format:
# 4 bytes: "PACK"
# 4 bytes: version (network order, must be 2)
# 4 bytes: object count (network order)
# N object entries (variable length)
# 20 bytes: SHA1 checksum
#
# Each object entry:
# variable-length type+size header
# for OFS_DELTA: variable-length negative offset
# for REF_DELTA: 20-byte base object SHA1
# zlib-compressed data
def parse(io)
# Returns array of {type:, content:, oid:} hashes
end
def read_object_header(io)
# Read type (3 bits) and size (variable-length)
end
def apply_delta(base_content, delta_data)
# Apply copy/insert delta instructions
end
end
```
### Packfile Generation (Ruby, for upload-pack)
```ruby
# bridge/lib/gitgres/pack_writer.rb
class Gitgres::PackWriter
# Generate a packfile from a set of object OIDs
#
# For the initial version, emit undeltified objects only.
# Delta compression is an optimization for later.
def build(objects)
# Returns binary packfile data
# Header: "PACK" + version(2) + count
# For each object: type+size header + zlib-compressed content
# Trailer: SHA1 of everything above
end
end
```
### Object Graph Walking (for fetch/clone)
When a client sends `want <oid>` and `have <oid>` lines, the server needs to determine which objects to send. For a full clone (no `have` lines), this means walking every reachable object from the wanted commits. For incremental fetch, it means walking from wanted commits back to the common ancestors, collecting objects along the way.
```ruby
# bridge/lib/gitgres/object_walk.rb
class Gitgres::ObjectWalk
# Given a set of "want" OIDs and "have" OIDs,
# determine the set of objects to include in the packfile.
#
# Algorithm:
# 1. Start from each wanted commit
# 2. Walk parents recursively until we reach a "have" commit (or root)
# 3. For each commit in the walk, include:
# - The commit object itself
# - Its tree (and all subtrees/blobs, recursively)
# - Skip trees/blobs already reachable from "have" commits
def needed_objects(repo_id, wants, haves)
# Uses recursive CTE in SQL for the commit walk,
# then collects trees and blobs
end
end
```
The commit walk can be done efficiently with a recursive CTE:
```sql
-- Find all commits reachable from :want_oids but not from :have_oids
WITH RECURSIVE commit_walk AS (
-- Base case: the wanted commits
SELECT oid, content
FROM objects
WHERE repo_id = :repo_id AND oid = ANY(:want_oids) AND type = 1
UNION
-- Recursive case: parents of commits in the walk
SELECT o.oid, o.content
FROM commit_walk cw,
LATERAL git_commit_parse(cw.content) cp,
LATERAL unnest(cp.parent_oids) AS parent_oid,
objects o
WHERE o.repo_id = :repo_id
AND o.oid = parent_oid
AND o.type = 1
AND o.oid != ALL(:have_oids)
)
SELECT oid FROM commit_walk;
```
## Phase 5: The C Extension
The C extension replaces the performance-sensitive parts. Here is what it provides over pure SQL.
### 1. Native `git_oid` Type
A fixed 20-byte binary type with hex input/output, btree and hash operator classes. This is essentially what the `hashtypes` extension provides, but bundled so there's no external dependency.
```c
// ext/gitgres/git_oid_type.c
PG_FUNCTION_INFO_V1(git_oid_in); // Parse hex string -> 20 bytes
PG_FUNCTION_INFO_V1(git_oid_out); // 20 bytes -> hex string
PG_FUNCTION_INFO_V1(git_oid_recv); // Binary receive
PG_FUNCTION_INFO_V1(git_oid_send); // Binary send
PG_FUNCTION_INFO_V1(git_oid_eq);
PG_FUNCTION_INFO_V1(git_oid_ne);
PG_FUNCTION_INFO_V1(git_oid_lt);
PG_FUNCTION_INFO_V1(git_oid_le);
PG_FUNCTION_INFO_V1(git_oid_gt);
PG_FUNCTION_INFO_V1(git_oid_ge);
PG_FUNCTION_INFO_V1(git_oid_cmp);
PG_FUNCTION_INFO_V1(git_oid_hash);
```
With this type, the schema changes from `oid bytea` to `oid git_oid`, and you can write `WHERE oid = 'a1b2c3...'` without explicit `decode()` calls.
### 2. C-level SHA1
```c
// ext/gitgres/sha1_hash.c
// git_object_hash_c(type smallint, content bytea) RETURNS git_oid
//
// Computes SHA1("<type> <size>\0<content>") in C using OpenSSL.
// ~10x faster than pgcrypto's digest() for small objects.
PG_FUNCTION_INFO_V1(git_object_hash_c);
```
### 3. C-level Tree Parsing
The PL/pgSQL tree parser walks bytea byte-by-byte with `get_byte()` and `substring()`, which is extremely slow for large trees. The C version uses pointer arithmetic.
```c
// ext/gitgres/pack_parse.c
// git_tree_entries_c(content bytea) RETURNS SETOF (mode text, name text, oid git_oid)
PG_FUNCTION_INFO_V1(git_tree_entries_c);
```
### 4. C-level Pack Parsing
For very large pushes, parsing the packfile inside Postgres (as a stored procedure that receives the raw bytes) avoids round-tripping every object through the bridge. This is an optimization for after the bridge works correctly.
```c
// git_unpack_objects(repo_id int, packdata bytea) RETURNS integer
// Parses a packfile and inserts all objects into the objects table.
// Returns the number of objects inserted.
// Handles OFS_DELTA and REF_DELTA by resolving bases from the same pack
// or from existing objects in the database.
PG_FUNCTION_INFO_V1(git_unpack_objects);
```
### 5. C-level Zlib Inflate
```c
// git_inflate(data bytea) RETURNS bytea
// Raw zlib inflate. Needed for pack object decompression if we
// move pack parsing into Postgres.
PG_FUNCTION_INFO_V1(git_inflate);
```
### Extension SQL File
```sql
-- ext/gitgres/sql/gitgres--0.1.sql
-- Type
CREATE TYPE git_oid;
CREATE FUNCTION git_oid_in(cstring) RETURNS git_oid AS 'gitgres' LANGUAGE C IMMUTABLE STRICT;
CREATE FUNCTION git_oid_out(git_oid) RETURNS cstring AS 'gitgres' LANGUAGE C IMMUTABLE STRICT;
-- ... recv, send, operators, operator class ...
CREATE TYPE git_oid (
INPUT = git_oid_in,
OUTPUT = git_oid_out,
RECEIVE = git_oid_recv,
SEND = git_oid_send,
INTERNALLENGTH = 20,
PASSEDBYVALUE = false,
ALIGNMENT = char,
STORAGE = plain
);
-- Operator class for btree
CREATE OPERATOR CLASS git_oid_ops DEFAULT FOR TYPE git_oid
USING btree AS
OPERATOR 1 <,
OPERATOR 2 <=,
OPERATOR 3 =,
OPERATOR 4 >=,
OPERATOR 5 >,
FUNCTION 1 git_oid_cmp(git_oid, git_oid);
-- Hash operator class
CREATE OPERATOR CLASS git_oid_hash_ops DEFAULT FOR TYPE git_oid
USING hash AS
OPERATOR 1 =,
FUNCTION 1 git_oid_hash(git_oid);
-- Functions
CREATE FUNCTION git_object_hash_c(smallint, bytea) RETURNS git_oid
AS 'gitgres' LANGUAGE C IMMUTABLE STRICT;
CREATE FUNCTION git_tree_entries_c(bytea)
RETURNS TABLE(mode text, name text, entry_oid git_oid)
AS 'gitgres' LANGUAGE C IMMUTABLE STRICT ROWS 100;
CREATE FUNCTION git_inflate(bytea) RETURNS bytea
AS 'gitgres' LANGUAGE C IMMUTABLE STRICT;
CREATE FUNCTION git_unpack_objects(integer, bytea) RETURNS integer
AS 'gitgres' LANGUAGE C VOLATILE;
```
### Build System
```makefile
# ext/gitgres/Makefile (PGXS)
MODULE_big = gitgres
OBJS = gitgres.o git_oid_type.o sha1_hash.o zlib_inflate.o pack_parse.o
EXTENSION = gitgres
DATA = sql/gitgres--0.1.sql
SHLIB_LINK = -lssl -lcrypto -lz
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)
```
## Phase 6: Import/Export Utility
Before the full protocol bridge works, we need a way to get repos in and out of the database. A Ruby script that shells out to git and uses the `pg` gem.
```ruby
# bridge/bin/gitgres-import
# Usage: gitgres-import <repo-path> <database-url> <repo-name>
#
# 1. Opens the git repo with Rugged (or shells out to git)
# 2. Walks all objects using `git rev-list --objects --all`
# 3. For each object, reads its type and content with `git cat-file`
# 4. Inserts into the objects table
# 5. Reads all refs with `git show-ref` and inserts into refs table
# bridge/bin/gitgres-export
# Usage: gitgres-export <database-url> <repo-name> <output-path>
#
# 1. Creates a bare git repo at output-path
# 2. Reads all objects from the database
# 3. Writes each with `git hash-object -w --stdin` (with type flag)
# 4. Reads all refs from the database
# 5. Writes them with `git update-ref`
```
The import path using `git cat-file --batch` is fast enough for initial testing. For homebrew-core scale (hundreds of thousands of objects), we'd batch the INSERT with `COPY`.
## Testing Strategy
Tests use Minitest (per the preference for test unit style over rspec). Each test file connects to a test database, creates the schema in a transaction, runs assertions, and rolls back.
```ruby
# test/test_helper.rb
require 'minitest/autorun'
require 'pg'
class GitgresTest < Minitest::Test
def setup
@conn = PG.connect(dbname: 'gitgres_test')
@conn.exec("BEGIN")
# Load schema
@conn.exec(File.read('sql/schema.sql'))
# Load functions
Dir['sql/functions/*.sql'].sort.each { |f| @conn.exec(File.read(f)) }
# Create test repo
@conn.exec("INSERT INTO repositories (name) VALUES ('test')")
@repo_id = @conn.exec("SELECT id FROM repositories WHERE name = 'test'")[0]['id'].to_i
end
def teardown
@conn.exec("ROLLBACK")
@conn.close
end
end
```
Key test cases:
1. **`object_hash_test.rb`**: Compute SHA1 for known blob/tree/commit content, compare with `git hash-object`.
2. **`object_store_test.rb`**: Write an object, read it back, verify content and type match.
3. **`tree_parse_test.rb`**: Parse a known tree binary blob, verify mode/name/oid of each entry.
4. **`commit_parse_test.rb`**: Parse a known commit, verify tree, parents, author, committer, message.
5. **`ref_test.rb`**: Create/update/delete refs, test CAS (compare-and-swap) semantics.
6. **`pack_parse_test.rb`**: Generate a packfile with `git pack-objects`, parse it in Ruby, verify objects match.
7. **`pack_build_test.rb`**: Build a packfile from known objects, verify `git index-pack` accepts it.
8. **`pkt_line_test.rb`**: Encode/decode pkt-lines, test flush packets, test the "# service=" header.
9. **`integration_test.rb`**: Start the HTTP server, `git push` a small repo, `git clone` from it, verify the clone matches the original. This is the acid test.
For the integration test, create a temp git repo with a few commits, push it to the gitgres server, clone from the server into a new directory, and `diff -r` the two working trees.
## Build Order
The work naturally sequences into milestones where each produces something testable.
**Milestone 1: Schema + Object Store (days 1-2)**
- `sql/schema.sql`
- `sql/functions/object_hash.sql`
- `sql/functions/object_read_write.sql`
- `test/schema_test.rb`, `test/object_hash_test.rb`, `test/object_store_test.rb`
- At this point you can verify that `git_object_hash(3, 'hello')` produces the same SHA1 as `echo -n hello | git hash-object --stdin`.
**Milestone 2: Parsing Functions (days 3-4)**
- `sql/functions/tree_parse.sql`
- `sql/functions/commit_parse.sql`
- `test/tree_parse_test.rb`, `test/commit_parse_test.rb`
- Test against real tree and commit objects extracted from a git repo with `git cat-file`.
**Milestone 3: Import Utility (day 5)**
- `bridge/bin/gitgres-import`
- Import a small repo (say 50 commits). Verify the object count matches `git rev-list --objects --all | wc -l`. Verify a few random objects match `git cat-file -p`.
**Milestone 4: Materialized Views (day 5)**
- `sql/views/queryable.sql`
- Run `REFRESH MATERIALIZED VIEW` after import. Write example queries against commits_view and tree_entries_view.
**Milestone 5: Ref Management + Pkt-line (day 6)**
- `sql/functions/ref_manage.sql`
- `bridge/lib/gitgres/pkt_line.rb`
- `test/ref_test.rb`, `test/pkt_line_test.rb`
**Milestone 6: Pack Reader (days 7-8)**
- `bridge/lib/gitgres/pack_reader.rb`
- `test/pack_parse_test.rb`
- Test by generating packfiles with `git pack-objects` and parsing them. Verify each extracted object matches `git cat-file`.
**Milestone 7: Receive-Pack (HTTP, git push) (days 9-10)**
- `bridge/lib/gitgres/server.rb` (receive-pack endpoint)
- `bridge/lib/gitgres/receive_pack.rb`
- `bridge/lib/gitgres/ref_advertisement.rb`
- Test: start server, `git push` a repo, verify objects landed in database.
**Milestone 8: Pack Writer + Upload-Pack (HTTP, git clone) (days 11-12)**
- `bridge/lib/gitgres/pack_writer.rb`
- `bridge/lib/gitgres/object_walk.rb`
- `bridge/lib/gitgres/upload_pack.rb`
- Test: start server, `git clone` from it, verify clone matches original.
**Milestone 9: Full Integration (day 13)**
- `test/integration_test.rb`
- Push a repo, clone it, verify. Push more commits, fetch them, verify.
- Try with a larger repo (a few hundred commits).
**Milestone 10: C Extension (days 14-18)**
- `ext/gitgres/git_oid_type.c`
- `ext/gitgres/sha1_hash.c`
- `ext/gitgres/zlib_inflate.c`
- `ext/gitgres/pack_parse.c`
- Tests: same test suite, but with C functions replacing PL/pgSQL ones.
- Benchmark: compare hash computation, tree parsing, pack parsing times.
**Milestone 11: Scale Testing (days 19-20)**
- Import homebrew-core (or similar large repo)
- Measure import time, database size, query performance
- Tune: add `COPY` batching, possibly connection pooling, test with larger packs
## Key Decisions and Trade-offs
**bytea vs text for OIDs in pure SQL version.** I chose `bytea(20)` over `char(40)` hex. The 20-byte binary form halves storage and index size, and binary comparison is faster. The downside is that every query needs `encode(oid, 'hex')` for human-readable output, but that is what the materialized views handle.
**Storing raw content vs storing header+content.** The `objects` table stores only the content, not the git header (`blob 14\0...`). The type and size are in separate columns. This makes queries easier and avoids redundant data. The OID is recomputed from type+content when needed (rare -- mostly just at write time for verification).
**HTTP bridge vs git remote helper.** The HTTP smart protocol is the better first choice. It is stateless, well-documented, and debuggable with standard HTTP tools. The remote helper protocol (stdin/stdout) is an alternative if you want `git push gitgres://...` URLs, but adds complexity for no functional benefit in an experimental project.
**No delta compression in the pack writer initially.** The first version of pack_writer emits undeltified objects. Git clients will accept this -- it is valid if inefficient. Delta compression is a significant optimization that can be added later by finding similar objects and computing binary diffs.
**Packfile parsing in Ruby vs in Postgres.** The bridge parses packfiles in Ruby and inserts objects one at a time (or in batches). Moving pack parsing into a C stored procedure (`git_unpack_objects`) is the C extension's primary optimization for the write path. For the pure SQL version, Ruby-side parsing is the only option since PL/pgSQL cannot do zlib decompression.
**pgcrypto dependency.** The pure SQL version depends on `CREATE EXTENSION pgcrypto` for `digest()`. This is available on all managed Postgres services (RDS, Cloud SQL, Supabase, Neon) so it meets the "works on managed Postgres" requirement.
### Critical Files for Implementation
- `/Users/andrew/code/gitgres/sql/schema.sql` - Core tables (repositories, objects, refs, reflog) that everything else builds on
- `/Users/andrew/code/gitgres/sql/functions/object_hash.sql` - SHA1 computation matching git's format; correctness here is the foundation for the whole project
- `/Users/andrew/code/gitgres/bridge/lib/gitgres/pack_reader.rb` - Packfile parser for the write path; the hardest single component (variable-length encoding, zlib decompression, delta resolution)
- `/Users/andrew/code/gitgres/bridge/lib/gitgres/server.rb` - HTTP smart protocol server; the glue between git and the database
- `/Users/andrew/code/gitgres/ext/gitgres/git_oid_type.c` - Native 20-byte OID type; the centerpiece of the C extension that changes the schema and all query patterns
agentId: ad1f9bd62dcb6948c (for resuming to continue this agent's work if needed)
<usage>total_tokens: 49407
tool_uses: 30
duration_ms: 448674</usage>Excellent, I now have a thorough plan. Let me write it to the plan file and then ask the user a few clarifying questions before finalizing.
Let me think about what questions I should ask:
Actually, let me just write the plan and present it. The user's context was pretty clear on the approach. Let me check if there are specific things I need to clarify.
The user was explicit about starting with pure SQL, so I'll document that decision and move forward with writing out the plan file.
Good research from both agents. Let me write the plan.
The user wants to build a system where PostgreSQL serves as a git backend storage. Two approaches: pure SQL and a Postgres extension. Let me understand the scope:
This is a significant project. Let me plan this out before diving in.
The user has a clear vision from the conversation context. Let me enter plan mode to design the approach.
User has answered your questions: "For the import utility and bridge, should we use the `pg` gem directly to talk to Postgres, or do you have a preferred database connection approach?"="what? why is ruby involved?" user notes: what? why is ruby involved?, "For the HTTP server in the bridge, WEBrick ships with Ruby and needs no extra gems. Alternatives add dependencies but are more capable. Which do you prefer?"="no" user notes: no. You can now continue with the user's answers in mind.
The user is surprised by Ruby being involved. Let me re-read the context. They said "Claude Code session to build both the pure SQL and extension versions." They want to build SQL functions and a C extension, not a Ruby bridge.
Looking back at the context: "Extension vs pure SQL: A CREATE EXTENSION git would be elegant but kills managed Postgres deployability. Better to start with pure SQL tables and PL/pgSQL functions that work everywhere, then optionally build the C extension for performance."
The user's CLAUDE.md says "Prefer ruby for scripting tasks" but the user clearly doesn't want Ruby here. They want:
The bridge is a separate concern that can come later. For now I should focus on getting git objects into Postgres and queryable—the import script can be a shell script using git CLI and psql, and I can add materialized views and the C extension after that.
Fair point. Your CLAUDE.md says "prefer ruby for scripting" but this is really a SQL + C project. Let me rethink.
The bridge between git and the database needs to speak the git pack protocol, which requires a program somewhere. But that doesn't need to be the first thing we build.
User has answered your questions: "What language should the git protocol bridge be in? This is the piece that translates between git push/fetch and the database."="C (with libgit2)". You can now continue with the user's answers in mind.
Good, the user wants the bridge in C using libgit2. This makes sense -- libgit2 has the pluggable backend interface, so we implement the ODB and refdb backends that store in Postgres, and libgit2 handles all the protocol work.
Let me revise the plan:
Actually, the libgit2 backend and the Postgres extension could share some code.
I'm organizing this into three directories: SQL schemas and functions, a Postgres extension with the git_oid type and C functions, and the libgit2 backend layer that implements the ODB and refdb interfaces using libpq to query the database. Once Postgres is registered as the backend, libgit2 handles all the git protocol details -- pack negotiation, delta resolution, ref handling -- and my code just needs to execute the right SQL queries.
That's the cleaner approach. libgit2 handles all the protocol work; we just implement the storage layer. Let me revise the plan.
Entered plan mode. You should now focus on exploring the codebase and designing an implementation approach. In plan mode, you should: 1. Thoroughly explore the codebase to understand existing patterns 2. Identify similar features and architectural approaches 3. Consider multiple approaches and their trade-offs 4. Use AskUserQuestion if you need to clarify the approach 5. Design a concrete implementation strategy 6. When ready, use ExitPlanMode to present your plan for approval Remember: DO NOT write or edit any files yet. This is a read-only exploration and planning phase.
The user doesn't want to proceed with this tool use. The tool use was rejected (eg. if it was a file edit, the new_string was NOT written to the file). STOP what you are doing and wait for the user to tell you how to proceed.
Let me explore the current state of the repo and research the git object model to design this properly.
ls -la /Users/andrew/code/gitgres/total 0 drwxr-xr-x 2 andrew staff 64 23 Feb 08:14 . drwxr-xr-x 207 andrew staff 6624 23 Feb 08:14 ..
Empty directory. Let me research the git internals and prior art to design the schema properly.
Here is the full technical summary.
---
## 1. Git Object Model: Binary Format
Every git object follows the same storage pattern: a text header, a null byte, then the content. The SHA-1 hash is computed over the entire header+content concatenation *before* compression. The compressed result is stored on disk.
**General format (pre-compression):**
```
<type> <decimal-size-in-ascii>\0<content-bytes>
```
The SHA-1 is `SHA1("<type> <size>\0<content>")`. The result is a 20-byte (40 hex character) hash. On disk, loose objects live at `.git/objects/<first-2-hex>/<remaining-38-hex>` and the file content is zlib-deflated.
### Blob
The simplest object. Content is the raw file bytes, nothing else.
```
blob <size>\0<raw file bytes>
```
No filename, no permissions -- those live in the parent tree object.
### Tree
A tree represents a directory listing. Its content is a sequence of entries in binary format, sorted by name. Each entry:
```
<mode-ascii> <name>\0<20-byte-raw-sha1>
```
- **mode**: ASCII octal string -- `100644` (regular file), `100755` (executable), `040000` (subdirectory), `120000` (symlink), `160000` (gitlink/submodule)
- **space**: single 0x20 byte
- **name**: filename or directory name as raw bytes (no path separators)
- **null**: 0x00 terminator
- **SHA-1**: 20 raw bytes (not hex-encoded) pointing to a blob or another tree
The entries are concatenated with no separator between them -- the 20-byte fixed-length SHA-1 after the null byte implicitly marks the boundary. The tree object header wraps this the same way: `tree <total-content-size>\0<entries>`.
### Commit
Commit content is plain text with headers separated from the message by a blank line:
```
tree <40-hex-sha1>\n
parent <40-hex-sha1>\n (zero or more)
author <name> <<email>> <unix-timestamp> <tz-offset>\n
committer <name> <<email>> <unix-timestamp> <tz-offset>\n
gpgsig -----BEGIN PGP SIGNATURE-----\n (optional, continuation lines prefixed with a space)
<sig-data>\n
-----END PGP SIGNATURE-----\n
mergetag object <40-hex>\n (optional, for merge commits of signed tags)
type commit\n
...
\n
<commit message>
```
The timestamp is a Unix epoch integer. The timezone is `+HHMM` or `-HHMM`. The `gpgsig` header uses continuation lines prefixed with a single space. `mergetag` embeds the full content of a signed tag object, also with space-prefixed continuation.
The SHA-1 is computed over the full `commit <size>\0<all-text-above>`.
### Tag (annotated)
```
object <40-hex-sha1>\n
type <type-of-pointed-object>\n
tag <tag-name>\n
tagger <name> <<email>> <unix-timestamp> <tz-offset>\n
\n
<tag message>
```
A lightweight tag is just a ref pointing directly at a commit -- no tag object is created.
---
## 2. Packfiles
When git has many loose objects, or needs to transfer objects over the network, it packs them into `.pack` files with accompanying `.idx` index files.
### Pack File Format
**Header (12 bytes):**
```
4 bytes: signature "PACK" (0x5041434b)
4 bytes: version number (2, network byte order)
4 bytes: number of objects (network byte order)
```
**Object entries (repeated N times):**
Each entry begins with a variable-length integer encoding the object type and uncompressed size:
- First byte: bit 7 = continuation flag, bits 6-4 = 3-bit type, bits 3-0 = 4 low bits of size
- Subsequent bytes (while MSB is set): bit 7 = continuation flag, bits 6-0 = next 7 bits of size (shifted left by 4, then 11, then 18, etc.)
**Object type values:**
```
1 = OBJ_COMMIT
2 = OBJ_TREE
3 = OBJ_BLOB
4 = OBJ_TAG
5 = reserved
6 = OBJ_OFS_DELTA
7 = OBJ_REF_DELTA
```
For types 1-4 (undeltified), the type+size header is followed by zlib-compressed object content (just the content, no `blob <size>\0` header -- the type and size are in the pack entry header).
**REF_DELTA (type 7):** After the type+size header, a 20-byte raw SHA-1 of the base object, then zlib-compressed delta instructions.
**OFS_DELTA (type 6):** After the type+size header, a variable-length negative offset encoding. The offset tells you how many bytes backward from the current object's position in the pack to find the base object. Encoding: each byte uses 7 bits of data with MSB as continuation; for bytes after the first, add `2^7 + 2^14 + ...` to account for the encoding overhead. Then zlib-compressed delta instructions follow.
### Delta Instruction Format
Delta data (after decompression) starts with two variable-length integers:
1. Size of the base object (for verification)
2. Size of the reconstructed target object
Then a stream of instructions:
**Copy from base (MSB = 1):**
```
+----------+---------+---------+---------+---------+-------+-------+-------+
| 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 |
+----------+---------+---------+---------+---------+-------+-------+-------+
```
Bits 0-3 of the instruction byte indicate which offset bytes are present (little-endian). Bits 4-6 indicate which size bytes are present. Missing bytes are treated as zero. A size of 0 means 0x10000 (65536). This instruction copies `size` bytes from position `offset` in the base object to the output.
**Insert literal data (MSB = 0):**
```
+----------+============+
| 0xxxxxxx | data |
+----------+============+
```
The lower 7 bits give the count of literal bytes that follow. Must be non-zero. Maximum 127 bytes per insert instruction.
**Byte 0x00 is reserved and invalid.**
### Pack Index v2 Format
The `.idx` file lets you find any object in the `.pack` by OID without scanning the whole pack.
```
4 bytes: magic 0xff744f63
4 bytes: version = 2
256 * 4 bytes: fanout table (entry N = count of objects with first byte <= N; entry 255 = total objects)
N * 20 bytes: sorted OID table
N * 4 bytes: CRC32 table (CRC of each object's packed representation)
N * 4 bytes: offset table (31-bit offsets; if MSB set, index into large offset table)
Variable: large offset table (8-byte entries, for packs > 2GB)
20 bytes: SHA-1 of the pack file
20 bytes: SHA-1 of the index file
```
**Trailer of the .pack file:** 20-byte SHA-1 checksum of all preceding pack data.
### Thin Packs
A thin pack contains OFS_DELTA or REF_DELTA entries whose base objects are *not* in the pack -- they exist only on the receiving end. This reduces transfer size during fetch/push. The receiver must "thicken" the pack by resolving the missing bases from its local object store before the pack can be stored.
---
## 3. Refs
### Storage on Disk
**Loose refs:** Plain text files under `.git/refs/`. Each file contains a single 40-character hex SHA-1 followed by a newline.
```
.git/refs/heads/main -> "abc123...\n"
.git/refs/tags/v1.0 -> "def456...\n"
.git/refs/remotes/origin/main -> "789abc...\n"
```
**HEAD:** Usually a symbolic ref stored as:
```
ref: refs/heads/main\n
```
In detached HEAD state, it contains a bare 40-hex SHA-1 instead.
**Packed refs:** For performance, `git pack-refs` merges loose refs into a single `.git/packed-refs` file:
```
# pack-refs with: peeled fully-peeled sorted
<sha1> <refname>
^<peeled-sha1>
<sha1> <refname>
...
```
The header line declares traits. `peeled` means tag entries may have a `^<sha1>` line following them showing the peeled (dereferenced) target. `fully-peeled` means all tags have been peeled, and tags that can't be peeled are also marked. `sorted` means entries are in sorted order for binary search.
**Lookup order:** Git checks loose refs first (the file under `.git/refs/`), then falls back to `.git/packed-refs`. When a loose ref is updated, any corresponding packed-refs entry becomes stale and the loose file takes precedence.
### Ref Types
- **Branches:** `refs/heads/<name>` -- points to a commit
- **Tags:** `refs/tags/<name>` -- points to a commit (lightweight) or tag object (annotated)
- **Remote tracking:** `refs/remotes/<remote>/<branch>` -- read-only bookmarks
- **Symbolic refs:** Only HEAD normally; content is `ref: <target-ref>` instead of a SHA-1
- **Special refs:** `FETCH_HEAD`, `MERGE_HEAD`, `ORIG_HEAD`, etc.
### Reflogs
Each ref can have a reflog at `.git/logs/<refname>`. Each line records:
```
<old-sha1> <new-sha1> <name> <<email>> <timestamp> <tz>\t<message>\n
```
---
## 4. Git Transfer Protocol
### Protocol v1 (pack protocol)
All communication uses **pkt-line** framing: a 4-hex-digit length prefix (including the 4 bytes themselves), followed by data. `0000` is a flush packet signaling end of message.
#### Fetch (upload-pack)
1. **Reference discovery:** Server immediately sends all refs, with capabilities on the first line after a NUL byte:
```
<sha1> HEAD\0<capability-list>\n
<sha1> refs/heads/main\n
<sha1> refs/tags/v1.0\n
<sha1> refs/tags/v1.0^{}\n (peeled tag)
0000
```
2. **Negotiation:** Client sends `want` lines for desired objects (with capabilities on the first `want`), then `have` lines in batches of up to 32, then `done`:
```
want <sha1> multi_ack_detailed side-band-64k ofs-delta thin-pack\n
want <sha1>\n
0000
have <sha1>\n
have <sha1>\n
0000
```
Server responds with `ACK <sha1> common` / `ACK <sha1> ready` / `NAK`. When the server sends `ready`, it has found enough common ground to build an efficient pack. Client sends `done` to finalize.
3. **Pack transfer:** Server sends the packfile multiplexed over side-band channels:
- Channel 1: pack data
- Channel 2: progress/stderr
- Channel 3: fatal error
#### Push (receive-pack)
1. **Reference discovery:** Same format as fetch but different capabilities (`report-status`, `delete-refs`, `atomic`, `push-options`, `ofs-delta`).
2. **Update commands:** Client sends old-sha1/new-sha1/refname triples:
```
<old-sha1> <new-sha1> refs/heads/main\0report-status\n
<old-sha1> <new-sha1> refs/heads/feature\n
0000
```
A zero SHA-1 as old-id means create; zero as new-id means delete.
3. **Packfile:** Client sends the pack data (unless only deleting refs).
4. **Status report:** If `report-status` was negotiated:
```
unpack ok\n
ok refs/heads/main\n
ng refs/heads/feature non-fast-forward\n
0000
```
5. **Server-side hooks:** `pre-receive` (all-or-nothing gate), `update` (per-ref), `post-receive` (after success). Objects are quarantined until pre-receive passes.
#### Key Capabilities
| Capability | Effect |
|---|---|
| `multi_ack_detailed` | Server sends `ACK common`/`ACK ready` during negotiation |
| `side-band-64k` | Multiplexed transfer, up to 65519 bytes per packet |
| `ofs-delta` | Pack may use offset deltas (type 6) |
| `thin-pack` | Pack may reference base objects not in the pack |
| `include-tag` | Send annotated tag objects when their targets are sent |
| `report-status` | Server reports per-ref success/failure on push |
| `atomic` | All ref updates succeed or all fail |
| `delete-refs` | Server accepts ref deletions |
| `no-done` | (HTTP) Skip final round-trip when server sends `ready` |
| `shallow` | Support shallow clone operations |
| `filter` | Support partial clone filtering |
### Protocol v2
Protocol v2 restructures the conversation into explicit commands. The client requests v2 by sending `version=2` in the transport-specific way (extra parameters for git://, `GIT_PROTOCOL` env for SSH, `Git-Protocol` header for HTTP).
Server responds with a capability advertisement:
```
version 2
agent=git/2.40.0
ls-refs=unborn
fetch=shallow wait-for-done
server-option
object-format=sha1
```
**ls-refs command:** Explicitly requests ref listing (replaces v1's automatic advertisement). Supports arguments like `symrefs`, `peel`, and `ref-prefix <prefix>` for filtering.
**fetch command:** Handles negotiation and pack transfer. Uses sections in the response: `acknowledgments`, `shallow-info`, `wanted-refs`, `packfile-uris`, `packfile`. The `packfile` section uses side-band-64k multiplexing. Supports `want-ref` to request refs by name rather than SHA-1.
Key architectural difference: v2 is stateless. Each command completes in a single request/response. This makes it HTTP-friendly and load-balancer-friendly.
v2 adds `delim-pkt` (0001) and `response-end-pkt` (0002) in addition to `flush-pkt` (0000).
Push still uses `receive-pack` from v1 -- protocol v2 doesn't yet replace the push path.
---
## 5. libgit2 Pluggable Backends
libgit2 separates storage from logic through two backend interfaces: the Object Database (ODB) backend and the Reference Database (refdb) backend. You implement a C struct full of function pointers, register it, and libgit2 calls your code instead of touching the filesystem.
### ODB Backend (`git_odb_backend`)
From `/tmp/libgit2-inspect/include/git2/sys/odb_backend.h`:
```c
struct git_odb_backend {
unsigned int version;
git_odb *odb;
// Read full object. Allocate buffer with git_odb_backend_data_alloc().
// Out params: data pointer, size, object type.
int (*read)(void **data_out, size_t *len_out, git_object_t *type_out,
git_odb_backend *backend, const git_oid *oid);
// Find object by abbreviated OID.
int (*read_prefix)(git_oid *full_oid_out, void **data_out, size_t *len_out,
git_object_t *type_out, git_odb_backend *backend,
const git_oid *short_oid, size_t len);
// Read just type and size, no data.
int (*read_header)(size_t *len_out, git_object_t *type_out,
git_odb_backend *backend, const git_oid *oid);
// Write an object. OID is pre-computed by libgit2.
int (*write)(git_odb_backend *backend, const git_oid *oid,
const void *data, size_t len, git_object_t type);
// Create a write stream for large objects.
int (*writestream)(git_odb_stream **stream_out, git_odb_backend *backend,
git_object_size_t size, git_object_t type);
// Create a read stream.
int (*readstream)(git_odb_stream **stream_out, size_t *len_out,
git_object_t *type_out, git_odb_backend *backend,
const git_oid *oid);
// Check if object exists. Return 1 if yes, 0 if no.
int (*exists)(git_odb_backend *backend, const git_oid *oid);
// Check existence by prefix.
int (*exists_prefix)(git_oid *full_oid_out, git_odb_backend *backend,
const git_oid *short_oid, size_t len);
// Called on failed lookups to refresh caches.
int (*refresh)(git_odb_backend *backend);
// Iterate all objects.
int (*foreach)(git_odb_backend *backend, git_odb_foreach_cb cb, void *payload);
// Accept a packfile for bulk writing (used during fetch/push).
int (*writepack)(git_odb_writepack **writepack_out, git_odb_backend *backend,
git_odb *odb, git_indexer_progress_cb progress_cb,
void *progress_payload);
// Write a multi-pack-index.
int (*writemidx)(git_odb_backend *backend);
// Update last-used time on an existing object (called when write finds dup).
int (*freshen)(git_odb_backend *backend, const git_oid *oid);
// Free all resources. Required.
void (*free)(git_odb_backend *backend);
};
```
The `writepack` interface returns a `git_odb_writepack` with three callbacks:
```c
struct git_odb_writepack {
git_odb_backend *backend;
int (*append)(git_odb_writepack *, const void *data, size_t size, git_indexer_progress *stats);
int (*commit)(git_odb_writepack *, git_indexer_progress *stats);
void (*free)(git_odb_writepack *);
};
```
This is how packfile data arrives during push: libgit2 calls `append` repeatedly with chunks of pack data, then `commit` to finalize. Your backend needs to parse the pack stream, extract objects, and store them.
**What the SQLite reference backend does:** Single table, stores the raw object content (not the git header), with OID, type integer, size integer, and data blob as columns:
```sql
CREATE TABLE git2_odb (
oid CHARACTER(20) PRIMARY KEY NOT NULL,
type INTEGER NOT NULL,
size INTEGER NOT NULL,
data BLOB
);
```
The OID is stored as 20 raw bytes, not hex. Objects are stored in their decompressed form (no zlib). The git header (`blob 123\0`) is not stored -- the type and size columns replace it.
### Refdb Backend (`git_refdb_backend`)
From `/tmp/libgit2-inspect/include/git2/sys/refdb_backend.h`:
```c
struct git_refdb_backend {
unsigned int version;
// Initialize refdb for new repo (set up HEAD, etc.)
int (*init)(git_refdb_backend *backend, const char *head_target,
mode_t mode, uint32_t flags);
// Check if ref exists. Set *exists to 0 or 1. Required.
int (*exists)(int *exists, git_refdb_backend *backend, const char *ref_name);
// Look up a ref by name. Returns GIT_ENOTFOUND if missing. Required.
int (*lookup)(git_reference **out, git_refdb_backend *backend,
const char *ref_name);
// Create an iterator, optionally filtered by glob pattern. Required.
int (*iterator)(git_reference_iterator **iter,
struct git_refdb_backend *backend, const char *glob);
// Write a ref (direct or symbolic). Supports CAS via old/old_target. Required.
int (*write)(git_refdb_backend *backend, const git_reference *ref,
int force, const git_signature *who, const char *message,
const git_oid *old, const char *old_target);
// Rename a ref. Required.
int (*rename)(git_reference **out, git_refdb_backend *backend,
const char *old_name, const char *new_name, int force,
const git_signature *who, const char *message);
// Delete a ref. Supports CAS via old_id/old_target. Required.
int (*del)(git_refdb_backend *backend, const char *ref_name,
const git_oid *old_id, const char *old_target);
// Pack/optimize refs. Optional.
int (*compress)(git_refdb_backend *backend);
// Does this ref have a reflog? Return 1/0. Required.
int (*has_log)(git_refdb_backend *backend, const char *refname);
// Ensure reflog exists and will be appended on writes. Required.
int (*ensure_log)(git_refdb_backend *backend, const char *refname);
// Free all resources. Required.
void (*free)(git_refdb_backend *backend);
// Read a reflog. Required.
int (*reflog_read)(git_reflog **out, git_refdb_backend *backend,
const char *name);
// Write a reflog. Required.
int (*reflog_write)(git_refdb_backend *backend, git_reflog *reflog);
// Rename a reflog. Required.
int (*reflog_rename)(git_refdb_backend *backend,
const char *old_name, const char *new_name);
// Delete a reflog. Required.
int (*reflog_delete)(git_refdb_backend *backend, const char *name);
// Lock a ref for transactional updates. Optional (but needed for transactions).
int (*lock)(void **payload_out, git_refdb_backend *backend,
const char *refname);
// Unlock/commit a locked ref. success=1 means update, 2 means delete, 0 means discard.
int (*unlock)(git_refdb_backend *backend, void *payload, int success,
int update_reflog, const git_reference *ref,
const git_signature *sig, const char *message);
};
```
The iterator struct your backend must also implement:
```c
struct git_reference_iterator {
git_refdb *db;
int (*next)(git_reference **ref, git_reference_iterator *iter);
int (*next_name)(const char **ref_name, git_reference_iterator *iter);
void (*free)(git_reference_iterator *iter);
};
```
### Implications for a Postgres Backend
For the ODB, you need to handle:
- Individual object read/write (the core path)
- Prefix lookups (abbreviated SHA-1 resolution) -- this is a `LIKE` or prefix range query on the OID column
- `writepack` for receiving push data -- you parse the incoming packfile stream, resolve deltas, and store the resulting objects individually
- `foreach` for GC and enumeration
For the refdb, you need to handle:
- CRUD on refs (both direct/SHA-1 and symbolic/target-name refs)
- Compare-and-swap semantics on writes (`old` and `old_target` parameters) for atomic updates
- Reflog as an append-only log per ref
- Locking for transactions -- Postgres advisory locks or `SELECT FOR UPDATE` would work
- Glob-based iteration over ref names
The `write` callback in the ODB receives **pre-computed OIDs and decompressed content**. libgit2 handles the `<type> <size>\0` header construction and SHA-1 computation before calling your backend. So your Postgres tables store raw content, object type, and the OID -- you never need to deal with zlib or header formatting on the storage side.
The `writepack` path is the exception. During push, raw packfile bytes arrive via `append`. You need to either: (a) parse the pack yourself, resolve deltas, and insert individual objects, or (b) delegate to libgit2's indexer to do the parsing and then insert objects via the normal `write` path. The libgit2 built-in backends use the indexer approach.
agentId: a8cb904e97eedbf46 (for resuming to continue this agent's work if needed)
<usage>total_tokens: 66665
tool_uses: 33
duration_ms: 598245</usage>