The Problem
Genomic workflows are complex. Pipelines containing tools like STAR and BWA, along with nested metadata, remain difficult for researchers to locate using conventional search methods. Scientists typically think in terms of research goals (“align RNA-seq from mouse samples”) rather than exact pipeline names or IDs.
Traditional filtering interfaces force users to know exactly what they’re looking for. But what if researchers could simply describe their goal in natural language and have an AI system find and execute the right pipeline?
At NesterLabs, we built exactly that—an AI-powered system enabling researchers to locate and execute genomic workflows through natural language rather than traditional filtering interfaces.
Technical Architecture
The solution combines three key technologies:
- OpenAI Agent SDK for prompt interpretation and reasoning
- Weaviate Cloud for vectorized pipeline storage and retrieval
- Python + FastAPI for pipeline execution APIs
Pipeline Storage Structure
Pipelines are stored as JSON documents with fields including ID, name, organism specifications, tool names, and workflow steps. Descriptions are vectorized using sentence embeddings for semantic matching.
{
"id": "pipeline-001",
"name": "RNA-Seq Alignment Pipeline",
"organism": ["human", "mouse"],
"tools": ["STAR", "samtools", "featureCounts"],
"description": "Align RNA-seq reads to reference genome...",
"workflow_steps": [
"Quality control with FastQC",
"Alignment with STAR",
"BAM sorting and indexing",
"Read counting with featureCounts"
]
}Multi-Agent Design
The system employs specialized agents, each responsible for a specific domain of tasks. This modular approach allows for easy extension and maintenance.
Agent Roles
1. Orchestrator Agent
The brain of the system. Routes user requests to appropriate specialist agents based on intent classification. Maintains conversation context and ensures smooth handoffs between agents.
2. Pipeline Upload Agent
Manages ingestion of new pipelines into the system. Handles validation, embedding generation, and storage in the vector database.
3. Pipeline Search Agent
Performs semantic and structured searches across the pipeline catalog. Translates natural language queries into vector similarity searches and metadata filters.
Example queries it handles:
- “Find pipelines for mouse RNA-seq analysis”
- “Show me workflows that use STAR aligner”
- “What pipelines support paired-end sequencing?”
4. Ask Pipeline Detail Agent
Retrieves detailed specifications for specific pipelines. Answers questions about parameters, requirements, expected inputs/outputs, and execution time estimates.
5. Run Analysis Agent
Executes selected pipelines with user-provided parameters. Handles job submission, progress monitoring, and result retrieval.
6. Synthesizer Agent (Planned)
Consolidates outputs from multiple agents into coherent responses. Useful for complex queries requiring information from multiple sources.
Shared Context
All agents share a centralized context containing:
- User queries and conversation history
- Extracted metadata (organism, tools, data types)
- Current execution status
- Retrieved pipeline information
This shared state enables seamless collaboration between agents without requiring users to repeat information.
Semantic Search in Action
When a user asks “I need to align RNA-seq data from mouse liver samples,” the system:
- Intent Classification: Identifies this as a pipeline search request
- Entity Extraction: Extracts “RNA-seq,” “mouse,” “liver” as key entities
- Vector Search: Generates embedding for the query and searches for similar pipeline descriptions
- Metadata Filtering: Filters results by organism compatibility
- Ranking: Ranks results by relevance and presents top matches
- Response Generation: Formats results in natural language with actionable options
Key Insights
This architecture demonstrates several important principles for domain-specific AI systems:
- Semantic search beats keyword search for domain-specific discovery. Users don’t always know the exact terminology.
- Specialized agents outperform monolithic models for complex workflows. Each agent can be optimized for its specific task.
- Shared context is crucial for multi-turn conversations. Users shouldn’t have to repeat themselves.
- Domain expertise must be encoded in both the data structure and agent prompts.
Beyond Genomics
While built for genomic workflows, this architecture pattern is applicable to many domains:
- Documentation Search: Finding relevant docs across large technical catalogs
- DevOps Orchestration: Natural language interfaces for CI/CD pipelines
- Customer Support: Intelligent routing to specialized support agents
- Enterprise Search: Unified search across multiple internal systems
The key insight: semantic search paired with specialized reasoning agents creates effective interfaces for domain-specific data discovery.
Results
The system is now in production, helping researchers at life sciences organizations discover and execute genomic workflows through natural conversation. Key outcomes:
- Reduced pipeline discovery time from hours to minutes
- Enabled non-bioinformaticians to find and run appropriate workflows
- Improved pipeline utilization across the organization
- Created a foundation for adding new pipelines without UI changes