Introduction - Data Version Documentation

What is Data Version?

Data Version is a comprehensive data lake management platform that brings Git-like version control capabilities to your data pipelines. Built on Apache Iceberg, it enables you to branch, version, and rollback data with the same confidence you have with code.

The Problem We Solve

Modern data teams face a critical challenge: when mistakes happen in data pipelines, recovery is expensive and time-consuming. Traditional approaches require 2-3 week backfill campaigns costing $50K-200K per incident. Teams lack confidence to experiment, leading to brittle, untested pipelines.

Key Benefits:

Instant Rollback: Recover from pipeline failures in seconds, not weeks
Time Travel: Query historical data states without maintaining copies
Safe Experimentation: Branch and test schema changes with complete isolation
Zero Operations: Serverless architecture requires no infrastructure management

How It Works

Data Version transforms your SQL queries into versioned, scheduled ETL pipelines with a single click. The platform provides:

1. Query to Pipeline Transformation

Write a SQL query, click "Save as Table", and Data Version automatically creates a managed pipeline with scheduling, dependency tracking, and version control. Your query results become versioned Iceberg tables that you can branch, merge, and rollback.

2. Git-Like Operations for Data

Every table version is immutable and addressable. You can:

Time travel to any historical snapshot
Create branches for testing schema changes
Merge tested changes back to production
Rollback bad deployments in seconds

3. Native Lineage and Dependencies

Lineage is captured at authoring time, not reverse-engineered from logs. This enables:

Automatic impact analysis when schemas change
Intelligent query generation with full context awareness
Cascading dependency management across your entire pipeline

4. AI-Powered Query Generation

Ask business questions in natural language and get production-ready SQL queries. The AI understands your schema, lineage, and query patterns, generating queries that integrate seamlessly with your existing pipelines.

Architecture

Data Version consists of three integrated components:

Desktop Application

Electron-based desktop client that provides a deployment wizard and native desktop experience. The desktop app guides you through AWS deployment with a simple, intuitive interface.

React Web Interface

Modern web application for browsing your data catalog, writing queries, and managing versions. Features include:

Interactive SQL editor with AI assistance
Visual query builder
Data catalog browser with schema exploration
Pipeline scheduling and dependency management
Version control interface for branches and snapshots

Serverless Backend

AWS CDK-based infrastructure deployed to your AWS account:

Lambda Functions: Query execution, pipeline orchestration, version management
DynamoDB: Metadata storage and catalog management
S3: Iceberg table storage with versioning
Athena: SQL query engine
EMR Serverless: Python pipeline execution
EventBridge: Scheduled pipeline triggers

Use Cases

Pipeline Development and Testing

Create a branch of your production table, test schema changes or new transformations, then merge back when validated. No need to maintain separate dev/staging environments.

Disaster Recovery

When a bad deployment corrupts data, rollback to the last good snapshot in seconds. No manual intervention, no tribal knowledge required.

Data Quality Monitoring

Track data quality metrics across versions. When quality degrades, instantly identify which pipeline changes caused the issue and rollback if needed.

Cross-Team Collaboration

Spin up isolated data lake instances per team, share tables across departments, manage centrally. Organizational flexibility with data accessibility.

Technology Stack

Apache Iceberg: Open table format enabling time travel and versioning
AWS Serverless: Lambda, Athena, EMR Serverless, EventBridge
React + Electron: Modern, responsive user interface
CDK (Python): Infrastructure as code for reproducible deployments

Next Steps

Ready to get started? Head over to the Getting Started Guide to install the desktop client and deploy to your AWS account.

← Back to Home