Agentic AI for Genomics: Building Smart Pipeline Discovery with OpenAI + Weaviate

Abhijith M • Jun 27, 2025 • 8 min read

The Problem

Genomic workflows are complex. Pipelines containing tools like STAR and BWA, along with nested metadata, remain difficult for researchers to locate using conventional search methods. Scientists typically think in terms of research goals (“align RNA-seq from mouse samples”) rather than exact pipeline names or IDs.

Traditional filtering interfaces force users to know exactly what they’re looking for. But what if researchers could simply describe their goal in natural language and have an AI system find and execute the right pipeline?

At NesterLabs, we built exactly that—an AI-powered system enabling researchers to locate and execute genomic workflows through natural language rather than traditional filtering interfaces.

Technical Architecture

The solution combines three key technologies:

OpenAI Agent SDK for prompt interpretation and reasoning
Weaviate Cloud for vectorized pipeline storage and retrieval
Python + FastAPI for pipeline execution APIs

Pipeline Storage Structure

Pipelines are stored as JSON documents with fields including ID, name, organism specifications, tool names, and workflow steps. Descriptions are vectorized using sentence embeddings for semantic matching.

{
  "id": "pipeline-001",
  "name": "RNA-Seq Alignment Pipeline",
  "organism": ["human", "mouse"],
  "tools": ["STAR", "samtools", "featureCounts"],
  "description": "Align RNA-seq reads to reference genome...",
  "workflow_steps": [
    "Quality control with FastQC",
    "Alignment with STAR",
    "BAM sorting and indexing",
    "Read counting with featureCounts"
  ]
}

Multi-Agent Design

The system employs specialized agents, each responsible for a specific domain of tasks. This modular approach allows for easy extension and maintenance.

Agent Roles

1. Orchestrator Agent

The brain of the system. Routes user requests to appropriate specialist agents based on intent classification. Maintains conversation context and ensures smooth handoffs between agents.

2. Pipeline Upload Agent

Manages ingestion of new pipelines into the system. Handles validation, embedding generation, and storage in the vector database.

3. Pipeline Search Agent

Performs semantic and structured searches across the pipeline catalog. Translates natural language queries into vector similarity searches and metadata filters.

Example queries it handles:

“Find pipelines for mouse RNA-seq analysis”
“Show me workflows that use STAR aligner”
“What pipelines support paired-end sequencing?”

4. Ask Pipeline Detail Agent

Retrieves detailed specifications for specific pipelines. Answers questions about parameters, requirements, expected inputs/outputs, and execution time estimates.

5. Run Analysis Agent

Executes selected pipelines with user-provided parameters. Handles job submission, progress monitoring, and result retrieval.

6. Synthesizer Agent (Planned)

Consolidates outputs from multiple agents into coherent responses. Useful for complex queries requiring information from multiple sources.

Shared Context

All agents share a centralized context containing:

User queries and conversation history
Extracted metadata (organism, tools, data types)
Current execution status
Retrieved pipeline information

This shared state enables seamless collaboration between agents without requiring users to repeat information.

Semantic Search in Action

When a user asks “I need to align RNA-seq data from mouse liver samples,” the system:

Intent Classification: Identifies this as a pipeline search request
Entity Extraction: Extracts “RNA-seq,” “mouse,” “liver” as key entities
Vector Search: Generates embedding for the query and searches for similar pipeline descriptions
Metadata Filtering: Filters results by organism compatibility
Ranking: Ranks results by relevance and presents top matches
Response Generation: Formats results in natural language with actionable options

Key Insights

This architecture demonstrates several important principles for domain-specific AI systems:

Semantic search beats keyword search for domain-specific discovery. Users don’t always know the exact terminology.
Specialized agents outperform monolithic models for complex workflows. Each agent can be optimized for its specific task.
Shared context is crucial for multi-turn conversations. Users shouldn’t have to repeat themselves.
Domain expertise must be encoded in both the data structure and agent prompts.

Beyond Genomics

While built for genomic workflows, this architecture pattern is applicable to many domains:

Documentation Search: Finding relevant docs across large technical catalogs
DevOps Orchestration: Natural language interfaces for CI/CD pipelines
Customer Support: Intelligent routing to specialized support agents
Enterprise Search: Unified search across multiple internal systems

The key insight: semantic search paired with specialized reasoning agents creates effective interfaces for domain-specific data discovery.

Results

The system is now in production, helping researchers at life sciences organizations discover and execute genomic workflows through natural conversation. Key outcomes:

Reduced pipeline discovery time from hours to minutes
Enabled non-bioinformaticians to find and run appropriate workflows
Improved pipeline utilization across the organization
Created a foundation for adding new pipelines without UI changes

Build With Us

Interested in building similar solutions for your organization? Let's discuss how we can help.

Get in Touch