andrew shared

Public

Exploring Aggregated Stats Across Multiple RepositoriesPublic

Agent

Claude Code

Model

Claude Opus 3

Shared

Feb 6, 5:11 PM

Messages71

Downloads23

Summary

I'd like to investigate using the new forges module as a way to drive git-pkgs to aggregate stats from across multiple repos, to give similar insights but at an "org" level

Good overview. Let me look at the forges API more closely and the stats command to understand the building blocks.

Readforges.go

Readtypes.go

Searched files/Users/andrew/code/git-pkgs/forges/*.go

Readstats.go

Let me check if forges already has org-level listing, and look at the GitHub backend and the roadmap.

Readgithub.go

Readroadmap.md

Readstats.go

GreppedListRepo|FetchOrg|ListOrg|org|organization

Let me start by understanding what's in this repo and what git-pkgs/forges looks like.

GreppedListRepo|FetchOrg|ListOrg|org|organization

Searched files/Users/andrew/code/git-pkgs/git-pkgs/internal/database/*.go

Let me check the Stats types and the database query structure.

Greppedtype Stats struct|type AuthorStats|type StatsOptions|type NameCount

Greppedfunc.*GetStats|func.*GetAuthorStats

Readgo.mod

Now I have a clear picture. Let me summarize what I've found and think about what the orgs module could look like.

The forges module currently only supports:

FetchRepository(owner, repo) - single repo metadata
FetchTags(owner, repo) - tags for a single repo

There's no ListRepositories(org) or similar org-level listing capability. But the underlying libraries all support it:

go-github has Repositories.ListByOrg
gitea SDK has ListOrgRepos
gitlab client-go has Groups.ListGroupProjects
Bitbucket has similar APIs

So I need to decide whether to add a ListRepositories method to forges or handle it directly in the orgs module, then use that to iterate through each repo, clone it, run git-pkgs indexing, and aggregate the results into org-level statistics. The main challenge is figuring out how to store and organize this aggregated data across multiple repositories.

Looking at what org-level insights would actually be useful: total dependencies across the org, which packages appear in multiple repos, vulnerability exposure, license compliance, ecosystem breakdown, dependency freshness, which repos use specific packages, contributor activity on dependencies, and dependency churn patterns. These insights directly map to the roadmap items around forge integration, license tracking, and impact analysis.

Now I have a clear picture. Here's what I found and where I see this going.

Current state of forges: The module fetches metadata for individual repos (FetchRepository, FetchTags) across GitHub, GitLab, Gitea/Forgejo, and Bitbucket. It has no org-level listing -- it doesn't know how to enumerate repos for an owner/org. But the underlying SDKs (go-github, gitea/sdk, gitlab client-go) all support listing org repos.

The gap: There's no way to go from "give me an org name" to "here are all the repos and their aggregated dependency data." The orgs/ directory is empty, and git-pkgs is entirely per-repo today -- one SQLite database per repository.

What forges would need to add: A ListRepositories(ctx, owner) method on the Forge interface (or a separate OrgForge interface to avoid breaking existing implementations). Each backend would page through the forge API and return []Repository.

What the orgs module could then do:

Call forges.ListRepositories to get all repos for an org
Clone (shallow) or open each repo locally
Run git-pkgs indexing per repo (or read existing databases if already indexed)
Aggregate into org-level views

The interesting org-level insights, pulling from the roadmap and the existing stats structure:

Shared dependencies -- which packages appear across multiple repos, and at what versions
"Which repos use package X" -- the inverse query, already called out in the roadmap
Vulnerability exposure -- how many repos are affected by a given CVE
License compliance -- org-wide license inventory, flag repos with problematic licenses
Ecosystem breakdown -- how many npm vs gem vs pypi dependencies org-wide
Staleness -- which repos have the most outdated deps
Dependency churn -- which repos have the most dependency activity

A few design questions before going further:

Used AskUserQuestion

The user wants:

Add ListRepositories to the forges module
Meta DB + per-repo DBs for storage
CLI tool in this orgs repo

Let me plan this out. The work splits into two parts:

Changes to forges/ module: add ListRepositories to the Forge interface and implement for each backend
New orgs/ module: CLI tool that uses forges to list repos, clones/indexes them with git-pkgs, and aggregates stats into a meta DB

Let me plan the implementation properly. This is a non-trivial multi-step task.

Good, that's clear. Two workstreams: extend forges with org listing, then build the orgs CLI here that drives git-pkgs across repos.

Let me plan the implementation.

Used EnterPlanMode

Searched files/Users/andrew/code/git-pkgs/*

Ran sub-agentExplore forges module patterns

Ran sub-agentExplore git-pkgs CLI structure

Now I have a thorough understanding of both the forges module and the git-pkgs CLI tool. Let me now design the plan. Let me first check what the GitHub API provides for listing org repos, and look at the other SDKs' org listing capabilities.

Let me now think about the design and write the plan.

Key decisions:

Forges module changes: Add ListRepositories(ctx, owner) returning []Repository to the Forge interface. Each backend implements it with pagination.
Orgs CLI tool: New Go module in this directory that:
- Takes an org name + forge domain
- Uses forges to list repos
- Clones each repo (shallow)
- Runs git-pkgs indexing on each
- Stores per-repo databases in a data directory
- Maintains a meta SQLite DB that tracks repos and aggregated data
- Provides org-level queries

For the CLI commands, I'm thinking git orgs init to bootstrap the org by listing and indexing all repos, git orgs stats for aggregated metrics, git orgs search to find which repos use a specific package, and git orgs vulns to surface vulnerabilities across the organization.

I'd also want git orgs licenses for compliance checking, git orgs list to see what's being tracked, and git orgs sync to refresh the indexes.

For the meta database structure, I really just need a repos table with basic info like the full name, owner, language, and last indexed timestamp. The key insight is that each repo already has its own complete dependency database, so the meta DB doesn't need to duplicate that—it just tracks which repos exist and where to find their databases. Then I can leverage SQLite's ability to attach multiple databases at query time, letting me run cross-repo queries by attaching each repo's database and querying across them all at once.

But there's a practical limit—SQLite caps attached databases at around 10 by default, though it's configurable up to 125. For organizations with lots of repos, that won't work. A better approach is to have the meta database pull aggregated snapshots from each repo's database during sync, storing the dependency data alongside a repo reference. That way org-level queries just hit a single database.

Defining the schema...

Now I'm thinking through the sync workflow: for each repo, I'd open its local database, extract the latest dependencies and stats, then upsert them into the meta database. For the directory structure, since this is a CLI tool in the orgs repo, I'm organizing it with a main entry point, command modules for each operation, and internal packages for database and workspace management.

But the user's actual goal is investigatory—they want to explore using the forges module to aggregate stats across multiple repos. So rather than over-engineering the full CLI, I should focus on the core pieces: leveraging the new ListRepositories functionality, building a minimal CLI that can list org repos, clone and index them, and surface aggregated stats. One constraint I'm noticing is that git-pkgs keeps its internals private, so I'd need to either shell out to existing CLI commands, duplicate logic, or work with the public modules like manifests and registries. Shelling out seems like the most pragmatic approach for now.

Actually, there's a better path: I can clone the repos, call git pkgs init on each one to populate their databases, then read the .git/pkgs.sqlite3 files directly since the schema is stable and versioned. This gives me full data access without reimplementing internal code. Now I need to think through the Forge interface change—adding ListRepositories will break all existing implementations, so I should consider a separate interface to avoid that.

I'm defining an OrgForge interface with the ListRepositories method, then having the Client check if a forge implements it before calling the method. But actually, since all four backends can support listing repositories, I should just add it directly to the main interface instead.

The forges module is still in v0.x, so interface changes are expected. Looking at how each SDK handles this: GitHub has separate methods for listing org and user repositories, GitLab uses different endpoints for groups versus users, and Gitea has dedicated methods for org and user repos as well.

For Bitbucket, it's a simple GET request to the repositories endpoint. The tricky part is that the caller shouldn't need to know whether the owner is an organization or user—I should handle that abstraction in the implementation. I could try the org endpoint first and fall back to user on failure, or add an optional OwnerType field to ListOptions to let callers specify. The simplest approach is probably to just attempt both endpoints transparently.

Now I'm thinking through the interface design: should ListRepositories live on the Forge interface taking just the owner, while the Client method takes the full URL and parses it out? That would keep the pattern consistent with how FetchRepository and FetchTags work.

For the CLI itself, I'll start with a focused two-phase approach: first add the ListRepositories method to the Forge interface and implement it across all four backends with tests, then build out the orgs CLI command.

The CLI will need a meta database to track repositories and dependency snapshots, with commands to initialize by listing repos from the forges and cloning them, gather organization-level dependency statistics, search for which repos use specific packages, and list all tracked repositories.

Now let me design the implementation.

Ran sub-agentDesign orgs implementation plan

Let me verify a few details about the pkgs database schema and the dependency query before finalizing the plan.

Readqueries.go80 lines

Readschema.go

Good. I have everything I need. Let me also check one thing -- the SchemaVersion constant.

GreppedSchemaVersion\s*=

Now let me write the plan.

Used ExitPlanMode

Ran sub-agentExplore git-pkgs codebase