Harness Engineering for Legacy Migration (Part 2): Practical Implementation, Agent Design, and System Setup

Avatar von Sascha Turowski

In theory, harness engineering gives us a safe way to use AI for legacy migrations. In practice, the challenge is very concrete:

How do you design a system that can understand, analyze, and transform an entire legacy codebase—without losing control?

This post focuses on a practical setup, including a multi-agent pipeline, system design choices, and the real limitations you must account for.


Starting point: inventory before transformation

Before any migration begins, the system needs a structured understanding of both codebases:

  • the legacy system (what exists)
  • the target system (what should exist)

A useful approach is to introduce a scanning agent as the first stage in the harness.

This agent does not rewrite code. It builds an inventory.

From a technical perspective, it extracts:

  • modules, services, and dependencies
  • APIs and data flows
  • database interactions
  • external integrations

From a business perspective, it attempts to infer:

  • domain concepts (e.g., “orders”, “payments”, “users”)
  • critical workflows
  • implicit business rules embedded in code

A well-designed harness persists this as a structured knowledge base (graphs, schemas, or indexed documents), not just text summaries.


Identifying mistakes and logical problems

A second responsibility of the scanning phase is diagnostic analysis.

A dedicated analysis agent (or pass) can flag:

  • dead code
  • duplicated logic
  • inconsistent naming or models
  • unreachable branches
  • error-prone patterns
  • violations of modern best practices

More interestingly, it can surface logical inconsistencies, such as:

  • conflicting business rules across modules
  • mismatched assumptions between services
  • edge cases handled differently in different places

This is one of the hidden advantages of AI-assisted migration:
you are not just moving the system—you are auditing it.


Why you should not scan everything at once

It is tempting to give a large model the entire codebase and ask:

“Explain this system and rewrite it.”

This approach still fails in practice, for several reasons.

1. Context window limits still matter

Even with modern large context windows, full codebases often exceed limits. More importantly, even when they fit, effective attention degrades.

Models do not treat all tokens equally. As input grows:

  • important details get diluted
  • long-range dependencies are missed
  • reasoning becomes less reliable

2. “Slope” effects in long contexts

As context size increases, performance does not degrade linearly—it often drops off in a slope-like pattern:

  • early parts of the input dominate
  • middle sections are partially ignored
  • later sections may be over-weighted or inconsistently used

This creates subtle, hard-to-detect errors.


3. Hallucination risk increases with scope

When asked to reason over very large, complex systems:

  • the model fills gaps with plausible assumptions
  • it may “connect” components that are unrelated
  • it may invent abstractions that do not exist

This is not eliminated by newer models—it is reduced, but not removed.


Are bigger models solving this?

Larger and more advanced models have:

  • bigger context windows
  • better retrieval and reasoning
  • improved consistency

However, the fundamental constraints remain:

  • Attention is still finite
  • Noise increases with scale
  • Precision drops when tasks are underspecified

In practice, the best results still come from:

structured decomposition + retrieval + iterative reasoning

Not from single-pass “understand everything” prompts.


A multi-agent harness pipeline

To address these constraints, a practical system uses multiple specialized agents, each operating within controlled scope.

This is not just architectural preference—it is a reliability requirement.

A typical pipeline might look like this:


1. Inventory Agent

  • scans code in chunks
  • builds structured representations
  • extracts entities, relationships, and flows

Output:

  • system map
  • dependency graph
  • domain model candidates

2. Analysis Agent

  • reviews inventory data
  • identifies problems and inconsistencies
  • flags technical and logical issues

Output:

  • diagnostic report
  • risk areas for migration

3. Mapping Agent

  • aligns legacy components with target architecture
  • defines how modules should be transformed

Output:

  • migration plan per module/service
  • mapping between old and new structures

4. Transformation Agent

  • rewrites code incrementally
  • translates patterns, languages, and frameworks

Output:

  • candidate implementations

5. Verification Agent

  • runs tests and behavioral comparisons
  • checks correctness against legacy behavior

Output:

  • pass/fail signals
  • structured feedback

6. Integration Agent

  • prepares validated code for merge
  • ensures compatibility with the new system

Output:

  • deployable artifacts or pull requests

Why multiple agents matter

Splitting responsibilities across agents provides:

Controlled context
Each agent sees only what it needs. This improves accuracy.

Clear handoffs
Outputs become structured inputs for the next step.

Debuggability
You can identify where failures occur:

  • misunderstanding (inventory)
  • bad reasoning (analysis)
  • incorrect rewrite (transformation)

Iterative improvement
You can refine prompts and behavior per agent, instead of globally.


Human in the loop: where it fits

A fully autonomous system is not the goal—at least not today.

Human involvement is most valuable at:

  • Inventory validation
    Confirm that the system map reflects reality
  • Business logic review
    Validate inferred rules and workflows
  • Risk decisions
    Approve transformations in sensitive areas
  • Final integration
    Decide what gets merged or deployed

The harness should make human input:

  • targeted (only where needed)
  • efficient (review structured outputs, not raw code dumps)

Iterative learning and prompt evolution

One of the most practical advantages of this setup is that it improves over time.

You can:

  • refine prompts per agent
  • add rules based on past failures
  • incorporate domain-specific constraints
  • store successful patterns and reuse them

This turns the harness into a learning system, not just a pipeline.


A minimal system setup

A practical implementation might include:

  • Code ingestion layer
    • repo access
    • file chunking and indexing
  • Knowledge store
    • vector database or structured graph
    • stores system understanding
  • Agent orchestration
    • workflow engine controlling agent sequence
  • Execution environment
    • sandbox for running code and tests
  • Evaluation layer
    • automated checks and comparisons
  • Observability
    • logs, traces, and metrics for every step

Final thought

It is natural to hope that better models will eventually remove the need for careful system design.

So far, the opposite has been true.

As models become more powerful, the importance of harness engineering increases, not decreases.

The winning approach is not:

  • “Give the model everything”

It is:

  • “Design a system where the model only needs to understand what matters—at the right time.”

That is what turns AI from a risky abstraction engine into a precise tool for evolving complex systems.

Enjoying this article?

Subscribe to get new posts delivered straight to your inbox. No spam, unsubscribe anytime.

No spam. Unsubscribe anytime.

You may also like

See All Posts →

Leave a Comment

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert