📄 Khan et al. (2025) MutaGReP: Execution-Free Repository-Grounded Plan Search for Code-Use (arX⠶2502.15872v1 [cs⠶CL])
Part 4 of a series on MutaGReP — a strategic approach to library-scale code generation.
In this section, we follow the pipeline that scans through a large codebase, identifies its functions and classes, and converts them into searchable embeddings. These embeddings live in a vector database, letting the plan-based approach quickly pinpoint the relevant symbols for each step. By selectively pulling only what’s needed from the repository, it neatly avoids the pitfalls of naive, all-inclusive contexts.
9: Code Symbol Mining with AST Analysis
Before MutaGReP can perform any plan search, it needs to know what symbols exist in the repository. The repository could have hundreds or thousands of functions, classes, constants, etc. The process of extracting these symbols is referred to as symbol mining. In the implementation, this is handled by the mutagrep.coderec.v3.symbol_mining
module (as referenced in the example command).
For Python repositories (which LongCodeArena tasks use), MutaGReP's symbol mining implementation uses Python's ast
module to walk the abstract syntax tree of each file. The SymbolExtractor
class (which extends ast.NodeVisitor
) traverses the AST and identifies all class definitions (visit_ClassDef
), function definitions (visit_FunctionDef
), and method definitions (functions inside classes). For each symbol, it extracts:
- The symbol name
- Full path (qualified name including module and parent classes)
- Docstring using
ast.get_docstring(node)
- Complete code using
ast.get_source_segment
- File location information (filename, filepath, and line number)
- Symbol type (function, method, or class) using the
SymbolCategory
enum
The system creates a structured Symbol
object for each definition, represented as a Pydantic model with these fields. For example, if the repository has a file utils.py
with def compute_class_weights(labels): ...
and a class Model
with method train()
, the symbol mining captures:
- Symbol:
compute_class_weights
with full pathutils.compute_class_weights
and typeFUNCTION
- Symbol:
Model.train
with full pathutils.Model.train
and typeMETHOD
- Symbol:
Model
with full pathutils.Model
and typeCLASS
The module provides the extract_all_symbols_under_directory
function, which:
1. Finds all Python files in the directory using get_all_pyfiles_under_directory
2. For each file, calls extract_symbols_from_file
which:
- Reads the file content
- Parses it with ast.parse
- Creates a SymbolExtractor
instance and runs it
- Returns the list of symbols found
The system skips files with syntax errors (detecting them with a try-except around ast.parse
), and records errors using the loguru
logger. In addition to the basic symbol extraction, the module contains several additional analysis functions:
count_symbol_usage_frequency
: Analyzes code to count how often each symbol is usedcount_symbol_references
: Specifically counts call references to each symbolbuild_call_graph
: Creates a directed graph (usingnetworkx
) of function callsdetermine_symbol_degree_from_call_graph
: Identifies entry points and connectivitycompute_rankable_symbols
: Combines multiple metrics to rank symbols by importance
The output of this process is a list of Symbol
objects that can be serialized to JSON. The code includes utilities to extract clean signatures from symbols in extract_symbol_signature
, which formats function arguments, return types, and first sentences of docstrings into a readable format.
Rather than relying only on docstrings (which are often inconsistent or missing), MutaGReP uses this symbol data as input to a subsequent synthetic intent generation process, where an LLM produces more uniform natural language descriptions of each symbol.
In summary, MutaGReP's symbol mining via AST analysis is the foundational step that transforms a code repository into a structured knowledge base of symbols. By extracting comprehensive metadata about each symbol, the implementation builds a bridge between raw code and the semantic layer needed for planning and retrieval.
10: Generating Synthetic Intents for Symbols
Once the system has a list of symbols from the repository, the next step is to create natural language intents for each symbol. A synthetic intent in this context is a sentence or phrase that describes a plausible use-case or functionality of that symbol, phrased the way a user might query it. For example, for a function setupGun
(from Appendix A's DD4hep example), a synthetic intent might be "I want to configure the particle gun with specific parameters such as name, type, and energy level." These intents serve as semantic descriptors of the code symbols.
MutaGReP implements this process in the mutagrep.coderec.v3.intent_generation
module, which defines the OpenAIIntentGenerator
class. This class handles intent generation with the following key components:
- The
Intent
andIntentForSymbol
Pydantic models define the structure of generated intents and their relationship to symbols - The
generate_intents_for_symbols
function processes a collection of symbols, generating intents for each one - A specific prompt template (
intent_generation_template
) instructs the LLM on how to generate high-quality intents
The actual intent generation uses OpenAI's GPT-4o-mini model by default, as specified in the constructor:
def __init__(self, model: str = "gpt-4o-mini", num_intents_per_symbol: int = 5):
self.model = model
self.num_intents_per_symbol = num_intents_per_symbol
self.client = instructor.from_openai(OpenAI())
The system generates five intents per symbol by default, providing multiple semantic angles for each code entity. The prompt template specifically instructs the LLM to write first-person intents as if a user is expressing their goal:
You will be given the fully qualified name of the symbol in the codebase, and the name of the codebase itself.
You will be asked to generate a list of intents.
Each intent should be written in the first person, as if you are giving the intent to the junior developer.
Each intent should be unique and different from the other intents.
The prompt includes contextual information about the symbol being processed: - The repository name - The symbol's fully qualified path - The symbol's type (function, method, or class) - The symbol's code (truncated to 10,000 tokens if necessary)
The implementation handles large repositories efficiently through:
1. Parallel processing with ThreadPoolExecutor (configurable with max_workers
)
2. Idempotent operation that skips already processed symbols
3. Filtering out test-related symbols using the is_test_heuristic
function
4. Retrying failed API calls with exponential backoff
The processed intents are saved in a JSONL file (one per repository) using the PydanticJSONLinesWriter
. This enables the system to reuse previously generated intents on subsequent runs.
From Appendix A (Table 4), we learn that during plan search, when a plan step has an intent, the system finds synthetic intents closest to that intent via embedding similarity, and retrieves the corresponding symbols. The table shows examples of top-3 closest synthetic intents to a given query intent.
The synthetic intent generation approach offers several advantages: - It creates a consistent interface between user queries and code symbols - It generates homogeneous descriptions, unlike the variable quality of docstrings - It captures diverse use cases not explicitly mentioned in code documentation - It reduces false matches by incorporating contextual understanding of symbols
In summary, synthetic intent generation transforms raw code symbols into natural language descriptions that align with how users express their needs. This is a critical bridge between human language and code, enabling the vector search mechanisms to operate effectively.
11: Vector Database Integration (Qdrant and LanceDB)
MutaGReP's implementation is designed to be flexible in how it stores and queries embeddings for code symbols. It specifically integrates with Qdrant and LanceDB as the vector database backends, ensuring that vector search is both scalable and easy to use locally. These tools play a behind-the-scenes role in the symbol retrieval process described earlier.
The integration is implemented in vector_search.py
, which defines:
- Protocols and abstract interfaces like ObjectVectorDatabase
and Embedder
- Concrete implementations for both vector database backends
- Generic typing to ensure type safety across different model types
OpenAI Embedding Integration
The system uses OpenAI's embedding model by default through the OpenAIEmbedder
class, which provides:
def __init__(self, embedding_model="text-embedding-ada-002", batch_size=10):
self.embedding_model = embedding_model
self.batch_size = batch_size
self.client = OpenAI()
This embedder batches requests efficiently (10 items per batch by default) and exposes the embedding dimension through a property method. The implementation handles embedding extraction and order validation:
def batch_embed_openai(self, docs_to_embed: list[str], batch_size: int = 10,
embedding_model="text-embedding-ada-002") -> list[list[float]]:
# ... batching logic ...
response = self.client.embeddings.create(
model=embedding_model,
input=batch,
)
# Double check embeddings are in same order as input
for i, be in enumerate(response.data):
assert i == be.index
batch_embeddings = [e.embedding for e in response.data]
return batch_embeddings
Qdrant Integration
Qdrant is a high-performance vector similarity search engine that can run as a separate service (with a REST/gRPC API) (Getting Started with Qdrant: A Beginner's Guide to Vector Search). It's built to handle large volumes of vectors and supports filtering, payloads, etc. In MutaGReP, Qdrant is suitable if the repository is huge or if one wanted to deploy the system in a server environment. The QDrantVectorDatabase
class provides a clean interface to Qdrant:
def __init__(self, embedder: Optional[Embedder] = None, vector_database_url: str = ":memory:"):
self.vector_database_url = vector_database_url
self.client = QdrantClient(self.vector_database_url)
self.collection_names: set[str] = set()
self.embedder = embedder or OpenAIEmbedder()
It handles: 1. Creating collections with the appropriate vector dimension and distance metric (cosine) 2. Embedding and inserting documents with their payloads 3. Searching for nearest neighbors based on query embedding
The implementation:
- Connects to a Qdrant instance (defaulting to ":memory:" if no URL is provided)
- Creates collections with the appropriate vector dimension using recreate_collection
- Stores each document's embedding along with its payload
- Queries the collection for nearest neighbors given a new embedding
Qdrant is known for being reliable and having features like sharding and filtering. In context, it's perhaps more power than needed for a single-user local run on one repo, but good for research or larger deployments.
The system also provides a Pydantic-friendly wrapper PydanticQdrantVectorDatabase
that converts between Pydantic models and Qdrant's data structures, maintaining type safety throughout.
LanceDB Integration
LanceDB is a lightweight library that allows you to store vectors in an optimized format (Apache Arrow / Parquet) and query them without a separate server (Pydantic - LanceDB). It's basically an embedding database you can use within a Python script. LanceDB's integration with Pydantic makes for a clean API design. The PydanticLancedbVectorDatabase
class implements this integration:
def __init__(self, database_url: str, model: Type[BaseModelT],
embedder: Optional[Embedder] = None, table_name: str = "default.lancedb"):
self.database_url = database_url
self.model = model
self.embedder = embedder or OpenAIEmbedder()
self.db = lancedb.connect(self.database_url)
self.table_name = table_name
The implementation:
- Creates or opens a LanceDB table for the embeddings
- Inserts records for each symbol's embedding
- Connects to the database with lancedb.connect(self.database_url)
- Queries the LanceDB index by calling search(embedding).limit(limit)
The implementation takes advantage of LanceDB's persistence capabilities. When inserting documents, it either adds to an existing table or creates a new one:
def insert(self, docs: Sequence[Embeddable[BaseModelT]]) -> Any:
embeddings = self.embedder([doc.key for doc in docs])
data = [
{"vector": embedding, "payload": json.loads(doc.model_dump_json())}
for embedding, doc in zip(embeddings, docs)
]
try:
tbl = self.db.open_table(self.table_name)
tbl.add(data)
except FileNotFoundError:
tbl = self.db.create_table(self.table_name, data=data)
For searching, it embeds the query, performs the search, and returns properly typed results:
def search(self, query: str, limit: int = 10) -> Sequence[RetrievedEmbeddable[BaseModelT]]:
embedding = self.embedder([query])[0]
results = self.db[self.table_name].search(embedding).limit(limit).to_list()
# ... converts results to RetrievedEmbeddable instances ...
LanceDB can persist the data to a file, ensuring subsequent runs don't need re-embedding if the file is saved. The codebase explicitly indicates that PydanticLancedbVectorDatabase
is the primary implementation with:
implements(ObjectVectorDatabase)(PydanticLancedbVectorDatabase)
Structured Data Models
Pydantic's role is to structure the data that goes into these databases. The system uses several Pydantic models to provide structured representations:
-
Embeddable[BaseModelT]
- Wraps an item to be embedded with its key:python class Embeddable(BaseModel, Generic[BaseModelT]): key: str payload: BaseModelT
-
RetrievedEmbeddable[BaseModelT]
- Represents a search result with score and payload:python class RetrievedEmbeddable(BaseModel, Generic[BaseModelT]): score: float payload: BaseModelT = Field(default_factory=lambda: NullModel()) score_type: SymbolRetrievalScoreType = SymbolRetrievalScoreType.SIMILARITY
This integration means results from database queries are automatically cast to the appropriate Pydantic model types, so you get structured Python objects directly. With LanceDB's Pydantic integration, query results are automatically converted to the specified model type.
Symbol Retrieval Implementation
MutaGReP provides two concrete symbol retrievers in the plan_search/symbol_retrievers/
directory:
-
OpenAiVectorSearchSymbolRetriever
- Uses OpenAI embeddings with vector search:python def __call__(self, queries: Sequence[str], n_results: int = 5) -> Sequence[RetrievedSymbol]: retrieved_embeddables: list[RetrievedEmbeddable[Symbol]] = [] for query in queries: retrieved_embeddables.extend(self.vector_database.search(query, n_results)) # ... converts results to RetrievedSymbol instances ...
-
Bm25SymbolRetriever
- Uses BM25 algorithm for keyword-based retrieval as an alternative:python def __call__(self, queries: Sequence[str], n_results: int = 5) -> Sequence[RetrievedSymbol]: # Ignore any queries that are less than or equal to 2 characters queries = [query for query in queries if len(query) > 2] query_tokens = self.tokenizer.tokenize(list(queries)) results, scores = self.text_retriever.retrieve(query_tokens, k=n_results) # ... processes results ...
Both implement the common SymbolRetriever
protocol from domain_models.py
, which specifies:
class SymbolRetriever(Protocol):
def __call__(self, queries: Sequence[str], n_results: int = 5) -> Sequence[RetrievedSymbol]: ...
Code Search Tools
The search tools in plan_search/code_search_tools/
leverage these symbol retrievers:
-
DirectIntentSearchTool
- Directly maps intents to symbols:python def __call__(self, intention: str) -> CodeSearchToolOutput: retrieved_symbols = self.symbol_retriever( queries=[intention], n_results=self.symbols_to_retrieve ) # ... creates output with retrieved symbols ...
-
NoDuplicatesDirectIntentSearchTool
- Adds deduplication logic:python def __call__(self, intention: str) -> CodeSearchToolOutput: num_symbols_to_retrieve = self.symbols_to_retrieve * self.overretrieve_factor retrieved_symbols = self.symbol_retriever( queries=[intention], n_results=num_symbols_to_retrieve ) # ... creates output and deduplicates ...
-
OpenAiOneStepCodeSearchTool
- A more sophisticated approach that uses an LLM to: - Generate keywords from the intent - Retrieve symbols using those keywords - Filter and rank the symbols
The inclusion of both Qdrant and LanceDB accommodates different usage scenarios. This dual integration ensures that regardless of context (offline research experiments or integrated development environment), the system can perform fast vector lookups for symbols. It also future-proofs the implementation; if a new vector DB emerges, they could integrate it similarly through the common interfaces.
The important point is that MutaGReP's vector search component is not a black box but built on well-known tools:
- Qdrant provides a robust, production-grade similarity search (5 Minute RAG with Qdrant and DeepSeek)
- LanceDB provides an easy local solution with Pydantic integration (Pydantic - LanceDB)
In summary, MutaGReP's vector database integration provides a flexible, type-safe foundation for semantic code search. The system can store and retrieve symbol embeddings using either Qdrant (for production-grade scalability) or LanceDB (for simple local usage). The architecture follows a clean separation of concerns with protocols defining interfaces and concrete implementations providing specific behaviors. This design makes the system adaptable to different usage scenarios while maintaining consistent behavior across different vector database backends.