Mining the Repository and Embedding Its Symbols

Where code parsing meets vector databases to power retrieval

MutaGReP

📄 Khan et al. (2025) MutaGReP: Execution-Free Repository-Grounded Plan Search for Code-Use (arX⠶2502.15872v1 [cs⠶CL])

Part 4 of a series on MutaGReP — a strategic approach to library-scale code generation.

In this section, we follow the pipeline that scans through a large codebase, identifies its functions and classes, and converts them into searchable embeddings. These embeddings live in a vector database, letting the plan-based approach quickly pinpoint the relevant symbols for each step. By selectively pulling only what’s needed from the repository, it neatly avoids the pitfalls of naive, all-inclusive contexts.

9: Code Symbol Mining with AST Analysis

Before MutaGReP can perform any plan search, it needs to know what symbols exist in the repository. The repository could have hundreds or thousands of functions, classes, constants, etc. The process of extracting these symbols is referred to as symbol mining. In the implementation, this is handled by the mutagrep.coderec.v3.symbol_mining module (as referenced in the example command).

For Python repositories (which LongCodeArena tasks use), MutaGReP's symbol mining implementation uses Python's ast module to walk the abstract syntax tree of each file. The SymbolExtractor class (which extends ast.NodeVisitor) traverses the AST and identifies all class definitions (visit_ClassDef), function definitions (visit_FunctionDef), and method definitions (functions inside classes). For each symbol, it extracts:

The system creates a structured Symbol object for each definition, represented as a Pydantic model with these fields. For example, if the repository has a file utils.py with def compute_class_weights(labels): ... and a class Model with method train(), the symbol mining captures:

The module provides the extract_all_symbols_under_directory function, which: 1. Finds all Python files in the directory using get_all_pyfiles_under_directory 2. For each file, calls extract_symbols_from_file which: - Reads the file content - Parses it with ast.parse - Creates a SymbolExtractor instance and runs it - Returns the list of symbols found

The system skips files with syntax errors (detecting them with a try-except around ast.parse), and records errors using the loguru logger. In addition to the basic symbol extraction, the module contains several additional analysis functions:

The output of this process is a list of Symbol objects that can be serialized to JSON. The code includes utilities to extract clean signatures from symbols in extract_symbol_signature, which formats function arguments, return types, and first sentences of docstrings into a readable format.

Rather than relying only on docstrings (which are often inconsistent or missing), MutaGReP uses this symbol data as input to a subsequent synthetic intent generation process, where an LLM produces more uniform natural language descriptions of each symbol.

In summary, MutaGReP's symbol mining via AST analysis is the foundational step that transforms a code repository into a structured knowledge base of symbols. By extracting comprehensive metadata about each symbol, the implementation builds a bridge between raw code and the semantic layer needed for planning and retrieval.

10: Generating Synthetic Intents for Symbols

Once the system has a list of symbols from the repository, the next step is to create natural language intents for each symbol. A synthetic intent in this context is a sentence or phrase that describes a plausible use-case or functionality of that symbol, phrased the way a user might query it. For example, for a function setupGun (from Appendix A's DD4hep example), a synthetic intent might be "I want to configure the particle gun with specific parameters such as name, type, and energy level." These intents serve as semantic descriptors of the code symbols.

MutaGReP implements this process in the mutagrep.coderec.v3.intent_generation module, which defines the OpenAIIntentGenerator class. This class handles intent generation with the following key components:

  1. The Intent and IntentForSymbol Pydantic models define the structure of generated intents and their relationship to symbols
  2. The generate_intents_for_symbols function processes a collection of symbols, generating intents for each one
  3. A specific prompt template (intent_generation_template) instructs the LLM on how to generate high-quality intents

The actual intent generation uses OpenAI's GPT-4o-mini model by default, as specified in the constructor:

def __init__(self, model: str = "gpt-4o-mini", num_intents_per_symbol: int = 5):
    self.model = model
    self.num_intents_per_symbol = num_intents_per_symbol
    self.client = instructor.from_openai(OpenAI())

The system generates five intents per symbol by default, providing multiple semantic angles for each code entity. The prompt template specifically instructs the LLM to write first-person intents as if a user is expressing their goal:

You will be given the fully qualified name of the symbol in the codebase, and the name of the codebase itself.
You will be asked to generate a list of intents. 
Each intent should be written in the first person, as if you are giving the intent to the junior developer.
Each intent should be unique and different from the other intents.

The prompt includes contextual information about the symbol being processed: - The repository name - The symbol's fully qualified path - The symbol's type (function, method, or class) - The symbol's code (truncated to 10,000 tokens if necessary)

The implementation handles large repositories efficiently through: 1. Parallel processing with ThreadPoolExecutor (configurable with max_workers) 2. Idempotent operation that skips already processed symbols 3. Filtering out test-related symbols using the is_test_heuristic function 4. Retrying failed API calls with exponential backoff

The processed intents are saved in a JSONL file (one per repository) using the PydanticJSONLinesWriter. This enables the system to reuse previously generated intents on subsequent runs.

From Appendix A (Table 4), we learn that during plan search, when a plan step has an intent, the system finds synthetic intents closest to that intent via embedding similarity, and retrieves the corresponding symbols. The table shows examples of top-3 closest synthetic intents to a given query intent.

The synthetic intent generation approach offers several advantages: - It creates a consistent interface between user queries and code symbols - It generates homogeneous descriptions, unlike the variable quality of docstrings - It captures diverse use cases not explicitly mentioned in code documentation - It reduces false matches by incorporating contextual understanding of symbols

In summary, synthetic intent generation transforms raw code symbols into natural language descriptions that align with how users express their needs. This is a critical bridge between human language and code, enabling the vector search mechanisms to operate effectively.

11: Vector Database Integration (Qdrant and LanceDB)

MutaGReP's implementation is designed to be flexible in how it stores and queries embeddings for code symbols. It specifically integrates with Qdrant and LanceDB as the vector database backends, ensuring that vector search is both scalable and easy to use locally. These tools play a behind-the-scenes role in the symbol retrieval process described earlier.

The integration is implemented in vector_search.py, which defines: - Protocols and abstract interfaces like ObjectVectorDatabase and Embedder - Concrete implementations for both vector database backends - Generic typing to ensure type safety across different model types

OpenAI Embedding Integration

The system uses OpenAI's embedding model by default through the OpenAIEmbedder class, which provides:

def __init__(self, embedding_model="text-embedding-ada-002", batch_size=10):
    self.embedding_model = embedding_model
    self.batch_size = batch_size
    self.client = OpenAI()

This embedder batches requests efficiently (10 items per batch by default) and exposes the embedding dimension through a property method. The implementation handles embedding extraction and order validation:

def batch_embed_openai(self, docs_to_embed: list[str], batch_size: int = 10, 
                       embedding_model="text-embedding-ada-002") -> list[list[float]]:
    # ... batching logic ...
    response = self.client.embeddings.create(
        model=embedding_model,
        input=batch,
    )
    # Double check embeddings are in same order as input
    for i, be in enumerate(response.data):
        assert i == be.index
    batch_embeddings = [e.embedding for e in response.data]
    return batch_embeddings

Qdrant Integration

Qdrant is a high-performance vector similarity search engine that can run as a separate service (with a REST/gRPC API) (Getting Started with Qdrant: A Beginner's Guide to Vector Search). It's built to handle large volumes of vectors and supports filtering, payloads, etc. In MutaGReP, Qdrant is suitable if the repository is huge or if one wanted to deploy the system in a server environment. The QDrantVectorDatabase class provides a clean interface to Qdrant:

def __init__(self, embedder: Optional[Embedder] = None, vector_database_url: str = ":memory:"):
    self.vector_database_url = vector_database_url
    self.client = QdrantClient(self.vector_database_url)
    self.collection_names: set[str] = set()
    self.embedder = embedder or OpenAIEmbedder()

It handles: 1. Creating collections with the appropriate vector dimension and distance metric (cosine) 2. Embedding and inserting documents with their payloads 3. Searching for nearest neighbors based on query embedding

The implementation: - Connects to a Qdrant instance (defaulting to ":memory:" if no URL is provided) - Creates collections with the appropriate vector dimension using recreate_collection - Stores each document's embedding along with its payload - Queries the collection for nearest neighbors given a new embedding

Qdrant is known for being reliable and having features like sharding and filtering. In context, it's perhaps more power than needed for a single-user local run on one repo, but good for research or larger deployments.

The system also provides a Pydantic-friendly wrapper PydanticQdrantVectorDatabase that converts between Pydantic models and Qdrant's data structures, maintaining type safety throughout.

LanceDB Integration

LanceDB is a lightweight library that allows you to store vectors in an optimized format (Apache Arrow / Parquet) and query them without a separate server (Pydantic - LanceDB). It's basically an embedding database you can use within a Python script. LanceDB's integration with Pydantic makes for a clean API design. The PydanticLancedbVectorDatabase class implements this integration:

def __init__(self, database_url: str, model: Type[BaseModelT], 
             embedder: Optional[Embedder] = None, table_name: str = "default.lancedb"):
    self.database_url = database_url
    self.model = model
    self.embedder = embedder or OpenAIEmbedder()
    self.db = lancedb.connect(self.database_url)
    self.table_name = table_name

The implementation: - Creates or opens a LanceDB table for the embeddings - Inserts records for each symbol's embedding - Connects to the database with lancedb.connect(self.database_url) - Queries the LanceDB index by calling search(embedding).limit(limit)

The implementation takes advantage of LanceDB's persistence capabilities. When inserting documents, it either adds to an existing table or creates a new one:

def insert(self, docs: Sequence[Embeddable[BaseModelT]]) -> Any:
    embeddings = self.embedder([doc.key for doc in docs])
    data = [
        {"vector": embedding, "payload": json.loads(doc.model_dump_json())}
        for embedding, doc in zip(embeddings, docs)
    ]
    try:
        tbl = self.db.open_table(self.table_name)
        tbl.add(data)
    except FileNotFoundError:
        tbl = self.db.create_table(self.table_name, data=data)

For searching, it embeds the query, performs the search, and returns properly typed results:

def search(self, query: str, limit: int = 10) -> Sequence[RetrievedEmbeddable[BaseModelT]]:
    embedding = self.embedder([query])[0]
    results = self.db[self.table_name].search(embedding).limit(limit).to_list()
    # ... converts results to RetrievedEmbeddable instances ...

LanceDB can persist the data to a file, ensuring subsequent runs don't need re-embedding if the file is saved. The codebase explicitly indicates that PydanticLancedbVectorDatabase is the primary implementation with:

implements(ObjectVectorDatabase)(PydanticLancedbVectorDatabase)

Structured Data Models

Pydantic's role is to structure the data that goes into these databases. The system uses several Pydantic models to provide structured representations:

  1. Embeddable[BaseModelT] - Wraps an item to be embedded with its key: python class Embeddable(BaseModel, Generic[BaseModelT]): key: str payload: BaseModelT

  2. RetrievedEmbeddable[BaseModelT] - Represents a search result with score and payload: python class RetrievedEmbeddable(BaseModel, Generic[BaseModelT]): score: float payload: BaseModelT = Field(default_factory=lambda: NullModel()) score_type: SymbolRetrievalScoreType = SymbolRetrievalScoreType.SIMILARITY

This integration means results from database queries are automatically cast to the appropriate Pydantic model types, so you get structured Python objects directly. With LanceDB's Pydantic integration, query results are automatically converted to the specified model type.

Symbol Retrieval Implementation

MutaGReP provides two concrete symbol retrievers in the plan_search/symbol_retrievers/ directory:

  1. OpenAiVectorSearchSymbolRetriever - Uses OpenAI embeddings with vector search: python def __call__(self, queries: Sequence[str], n_results: int = 5) -> Sequence[RetrievedSymbol]: retrieved_embeddables: list[RetrievedEmbeddable[Symbol]] = [] for query in queries: retrieved_embeddables.extend(self.vector_database.search(query, n_results)) # ... converts results to RetrievedSymbol instances ...

  2. Bm25SymbolRetriever - Uses BM25 algorithm for keyword-based retrieval as an alternative: python def __call__(self, queries: Sequence[str], n_results: int = 5) -> Sequence[RetrievedSymbol]: # Ignore any queries that are less than or equal to 2 characters queries = [query for query in queries if len(query) > 2] query_tokens = self.tokenizer.tokenize(list(queries)) results, scores = self.text_retriever.retrieve(query_tokens, k=n_results) # ... processes results ...

Both implement the common SymbolRetriever protocol from domain_models.py, which specifies:

class SymbolRetriever(Protocol):
    def __call__(self, queries: Sequence[str], n_results: int = 5) -> Sequence[RetrievedSymbol]: ...

Code Search Tools

The search tools in plan_search/code_search_tools/ leverage these symbol retrievers:

  1. DirectIntentSearchTool - Directly maps intents to symbols: python def __call__(self, intention: str) -> CodeSearchToolOutput: retrieved_symbols = self.symbol_retriever( queries=[intention], n_results=self.symbols_to_retrieve ) # ... creates output with retrieved symbols ...

  2. NoDuplicatesDirectIntentSearchTool - Adds deduplication logic: python def __call__(self, intention: str) -> CodeSearchToolOutput: num_symbols_to_retrieve = self.symbols_to_retrieve * self.overretrieve_factor retrieved_symbols = self.symbol_retriever( queries=[intention], n_results=num_symbols_to_retrieve ) # ... creates output and deduplicates ...

  3. OpenAiOneStepCodeSearchTool - A more sophisticated approach that uses an LLM to: - Generate keywords from the intent - Retrieve symbols using those keywords - Filter and rank the symbols

The inclusion of both Qdrant and LanceDB accommodates different usage scenarios. This dual integration ensures that regardless of context (offline research experiments or integrated development environment), the system can perform fast vector lookups for symbols. It also future-proofs the implementation; if a new vector DB emerges, they could integrate it similarly through the common interfaces.

The important point is that MutaGReP's vector search component is not a black box but built on well-known tools:

In summary, MutaGReP's vector database integration provides a flexible, type-safe foundation for semantic code search. The system can store and retrieve symbol embeddings using either Qdrant (for production-grade scalability) or LanceDB (for simple local usage). The architecture follows a clean separation of concerns with protocols defining interfaces and concrete implementations providing specific behaviors. This design makes the system adaptable to different usage scenarios while maintaining consistent behavior across different vector database backends.