Introduction to Data Version

Version Control for Data Lakes

What is Data Version?

Data Version is a comprehensive data lake management platform that brings Git-like version control capabilities to your data pipelines. Built on Apache Iceberg, it enables you to branch, version, and rollback data with the same confidence you have with code.

The Problem We Solve

Modern data teams face a critical challenge: when mistakes happen in data pipelines, recovery is expensive and time-consuming. Traditional approaches require 2-3 week backfill campaigns costing $50K-200K per incident. Teams lack confidence to experiment, leading to brittle, untested pipelines.

Key Benefits:
  • Instant Rollback: Recover from pipeline failures in seconds, not weeks
  • Time Travel: Query historical data states without maintaining copies
  • Safe Experimentation: Branch and test schema changes with complete isolation
  • Zero Operations: Serverless architecture requires no infrastructure management

How It Works

Data Version transforms your SQL queries into versioned, scheduled ETL pipelines with a single click. The platform provides:

1. Query to Pipeline Transformation

Write a SQL query, click "Save as Table", and Data Version automatically creates a managed pipeline with scheduling, dependency tracking, and version control. Your query results become versioned Iceberg tables that you can branch, merge, and rollback.

2. Git-Like Operations for Data

Every table version is immutable and addressable. You can:

3. Native Lineage and Dependencies

Lineage is captured at authoring time, not reverse-engineered from logs. This enables:

4. AI-Powered Query Generation

Ask business questions in natural language and get production-ready SQL queries. The AI understands your schema, lineage, and query patterns, generating queries that integrate seamlessly with your existing pipelines.

Architecture

Data Version consists of three integrated components:

Desktop Application

Electron-based desktop client that provides a deployment wizard and native desktop experience. The desktop app guides you through AWS deployment with a simple, intuitive interface.

React Web Interface

Modern web application for browsing your data catalog, writing queries, and managing versions. Features include:

Serverless Backend

AWS CDK-based infrastructure deployed to your AWS account:

Use Cases

Pipeline Development and Testing

Create a branch of your production table, test schema changes or new transformations, then merge back when validated. No need to maintain separate dev/staging environments.

Disaster Recovery

When a bad deployment corrupts data, rollback to the last good snapshot in seconds. No manual intervention, no tribal knowledge required.

Data Quality Monitoring

Track data quality metrics across versions. When quality degrades, instantly identify which pipeline changes caused the issue and rollback if needed.

Cross-Team Collaboration

Spin up isolated data lake instances per team, share tables across departments, manage centrally. Organizational flexibility with data accessibility.

Technology Stack

Next Steps

Ready to get started? Head over to the Getting Started Guide to install the desktop client and deploy to your AWS account.

← Back to Home