Why Your ETL Can't Just Read My Database
Every time I start a new project with a data warehouse integration, we have the same conversation. The ETL engineers want direct read access to my application's database. I say no. They push back. I hold firm. Then we spend the next week having increasingly tense meetings where everyone thinks everyone else is being unreasonable.
I've been on both sides of this conversation enough times to know that neither side is wrong. We're just optimizing for different things. But after watching this pattern play out across multiple projects, I've come to believe that the contract-based approach isn't just an architectural preference. It's the only approach that actually protects both teams from pain down the road.
Here's what I've learned about why this tension exists and how to navigate it.
The Rule: Communication Through Contracts
In any distributed system I design, I have one non-negotiable rule: communication across bounded contexts (separate systems or teams that own their own data) must happen through well-defined contracts. That means OpenAPI-defined REST APIs, message contracts, or event streams. Not direct database access. Not shared tables. Not "just give me read-only credentials."
This isn't my rule. Eric Evans built the concept of Bounded Contexts into Domain-Driven Design, and every serious systems thinker since has reinforced it. Fowler's been documenting the problems with shared databases for years. Newman's entire book on microservices keeps coming back to the same idea: services shouldn't share databases. This isn't novel. It's industry consensus.
The rule isn't about control or gatekeeping. It's about managing coupling.
When you let external systems reach directly into your database, you create invisible dependencies. An engineer makes a schema change: renaming a column, changing a data type, refactoring table relationships. Something breaks. Not in their bounded context. Somewhere else. In a system they didn't even know was reading that table.
Now you're spending days tracking down what changed and who's affected. The fix usually involves picking the least terrible compromise because you're dealing with a coupling problem that spans team boundaries, deployment schedules, and organizational politics. I've been through this scenario enough times to recognize it immediately, and it's never quick and it's never clean.
Contracts prevent this. As long as I fulfill the contract, I can refactor my internals without breaking dependencies. If I need to change the contract, I can version it and give everyone time to migrate. That's vastly better than trying to coordinate simultaneous changes across multiple systems and teams.
Why ETL Engineers Resist This
Here's where I need to be honest: ETL engineers have legitimate reasons for wanting direct database access. I've talked to enough of them to understand the frustration.
Performance matters. Reading from a database with a well-tuned query is often faster than calling an API that wasn't designed for bulk data extraction. ETL jobs process millions of rows. An API that returns 100 records per page with separate HTTP requests for each page? That's a non-starter. The job that should take 30 minutes now takes 6 hours.
They're not changing anything. From the ETL perspective, they're read-only consumers. They're not writing data back, so how can they cause problems? The risk feels theoretical, especially when they're under pressure to deliver analytics dashboards that the business needed yesterday.
The database is the source of truth. If the application stores data in PostgreSQL, and the ETL is trying to replicate that data into a data warehouse, why add a layer of indirection? Just read the source of truth directly. It seems simpler, and in software, simpler is usually better.
APIs don't always expose what they need. The application's existing REST API wasn't designed for bulk extraction. That's a different problem requiring different endpoints optimized for ETL. But the solution isn't "no API." It's "APIs designed for ETL performance, not just UI."
I get all of this. These aren't bad arguments. They're practical concerns from people trying to do their jobs efficiently.
But they're still wrong.
The Problem With Direct Database Access
The coupling problem applies to ETL just as much as any other external dependency. Maybe more, because ETL jobs run in the background and break silently.
When your ETL reads directly from my database, I can't refactor my schema without coordinating with your team. That ORDER_DETAILS table you're reading? I wanted to rename it to ORDER_LINE_ITEMS because our domain language changed. That CUSTOMER_STATUS column you're filtering on? I wanted to replace it with a state machine pattern using separate tables. That JSON column you're parsing? I wanted to normalize it into relational tables.
But I can't do any of that without breaking your ETL job. And because your job runs at 2 AM and reports failures to a different on-call rotation, I won't even know it's broken until someone notices the dashboard data is stale three days later.
Here's a concrete example: An ETL job queries SELECT order_id, total_amount FROM ORDER_DETAILS. When you rename total_amount to total_cost, the query fails because the column no longer exists. The ETL job fails silently at 2 AM. Nobody notices until the dashboard shows stale data three days later.
This is the same coupling problem that makes monolithic applications hard to change. The difference is that in a monolith, at least the coupling is visible in the codebase. With direct database access, the coupling is invisible. You don't discover it until something breaks.
The Business Logic Problem
There's a second problem that's often worse than the coupling issue: when you read directly from the database, you bypass all the business logic in the application layer.
Let's say my application calculates an order total. It takes the line item prices, applies any discounts, adds shipping costs, calculates tax based on the customer's location, and applies any promotional credits. That logic lives in the application. It's tested, it's versioned, it's maintained.
If your ETL reads directly from the ORDER_LINES table, you don't get the calculated total. You get raw line items. So now you have to replicate that business logic in your ETL job. You need to understand the discount rules, the shipping calculation, the tax logic, the promotional credits. And when we change any of that logic (which we will, because business requirements change constantly), you need to update your ETL job to match.
Now we have two implementations of the same business logic. They will drift. I've never seen a scenario where they don't drift. And when they drift, the data warehouse shows different numbers than the application, and someone has to reconcile the differences while the CFO is asking why the revenue reports don't match.
This is why APIs matter. The API returns the calculated order total. You don't need to understand how we calculated it. You don't need to replicate our logic. You just read the value we give you. When the calculation changes, the API still returns the correct value, and your ETL job doesn't need to change at all.
Events Capture Intent, Not Just State
Here's the most important argument for consuming events from the business logic layer instead of reading the database directly: events tell you why something happened, not just what changed.
When your ETL reads directly from the database, whether through SQL queries or Change Data Capture, you only see state transitions. A number changed. A row was inserted. A status field went from "pending" to "complete." That's it. You have no idea why.
Let me give you a concrete example. Your database shows that inventory for Product 47 went from 100 units to 105 units. That's a fact. But why did it go up? Was it because:
- We received a shipment from our supplier?
- A customer returned 5 units?
- We corrected a data entry error?
- We transferred inventory from another warehouse?
If you're reading from the database, you can't tell. The inventory level is now 105. That's all you know.
But if your application publishes business events, you get the context:
// Shipment received
public record ShipmentReceived(
string ProductId,
int Quantity,
string SupplierId,
string PurchaseOrderNumber,
DateTimeOffset ReceivedAt);
// Order returned
public record OrderReturned(
string OrderId,
string ProductId,
int Quantity,
string ReturnReason,
DateTimeOffset ReturnedAt);
// Inventory correction
public record InventoryAdjusted(
string ProductId,
int PreviousQuantity,
int NewQuantity,
string AdjustmentReason,
string AdjustedBy,
DateTimeOffset AdjustedAt);
These events don't just tell you that inventory changed. They tell you why it changed, who or what caused the change, and all the relevant business context around that change.
This has massive implications for analytics. When your data warehouse consumes these events instead of database rows, you can answer questions that would be impossible otherwise:
- Attribution: Which suppliers contribute most to our inventory? Which return reasons are most common? Which warehouses do we transfer inventory from most frequently?
- Root cause analysis: When revenue dropped, was it because shipments were delayed, return rates spiked, or we had a wave of inventory adjustments?
- Business intelligence: The pattern of shipments vs. returns tells you about product quality, supplier reliability, and customer satisfaction in ways that raw inventory numbers never could.
The database can't give you this. A CRUD operation only captures the result: inventory_level = 105. The business event captures the entire story: ShipmentReceived { supplier: "ABC Corp", quantity: 5, purchaseOrder: "PO-12345" }.
In event-sourced systems, this becomes even more critical. Many of my applications use Event Sourcing as a persistence pattern. In these systems, the events are the source of truth. The database is just a projection, a materialized view derived from the event stream. When your ETL reads the database in an event-sourced system, you're reading a derivative, not the primary record. You're getting second-hand information that's already lost the business context. The events have all the detail. The database just has the summary.
Even in traditional CRUD systems, the business logic layer is where intent lives. The database stores the outcome. The application code determines why that outcome happened. When you bypass the application layer and read straight from the database, you're throwing away the most valuable information you could capture.
What About Change Data Capture?
CDC is a pattern where you track changes to database tables: inserts, updates, deletes. Then stream those changes to downstream systems. Tools like Debezium can tail the database transaction log and publish change events to Kafka. From there, your ETL can consume those events and update the warehouse incrementally.
This is a legitimate pattern, and it solves some of the performance problems with API-based extraction. You're not polling for changes or doing full table scans. You're getting a stream of deltas that you can process efficiently.
But CDC doesn't solve the coupling problem or the business logic problem. And it definitely doesn't solve the intent problem.
CDC gives you raw database changes: a row was inserted, a column was updated, a record was deleted. That's pure state transition. You're still reading the what, not the why. You see that inventory_level changed from 100 to 105, but you have no idea whether that was a shipment, a return, a correction, or a transfer. All the business context is gone.
You're still coupled to the schema. And you're still bypassing any transformations or calculations that happen in the application layer.
CDC solves the performance problem. You get real-time deltas without API calls. But CDC gives you raw schema changes, and you're still coupled to the schema. The pattern that works: use CDC to feed an event broker, then expose the events as a versioned contract layer. This gives you CDC's performance with contract boundaries. That's different from direct CDC reads.
Database Views as a Middle Ground
Database views are the tempting middle ground. Teams create views that present a stable, contract-like interface to the data. The application team creates views that present a stable interface. The ETL reads from the views instead of the raw tables. If the schema changes, the application team updates the views to maintain compatibility.
This is better than unrestricted direct access, but it's still not a contract in the way I mean it. Views are defined in SQL. Not in a format that supports versioning, documentation, or tooling. No OpenAPI spec. No version negotiation.
Views also live in the database, which means they're part of your schema migration process. Every time you deploy a schema change, you need to test that all the views still work. This creates deployment coupling between your application and your ETL, which is exactly what we're trying to avoid.
Can views work? Sure, in small systems with stable schemas and tight coordination between teams. But as the system grows, you'll end up wishing you had invested in proper contracts from the start.
Solving the Performance Problem
The biggest legitimate concern from ETL engineers is performance. APIs designed for interactive use don't scale to bulk data extraction. So how do you solve that without giving up on contracts?
Build bulk extraction endpoints. Create API endpoints specifically designed for ETL. These endpoints support pagination with reasonable page sizes (thousands of records, not dozens), filtering by date ranges, and bulk formats like CSV or Parquet. They're still contracts: versioned, documented, and maintained. But they're optimized for the ETL use case.
Expose event streams. Instead of an API, publish domain events to a message broker like Kafka or RabbitMQ. Your application publishes events whenever something interesting happens: order created, customer updated, payment processed. The ETL subscribes to those events and updates the warehouse incrementally.
This is my preferred approach because it solves both the performance problem and the intent problem. Events are contracts: they have schemas, they're versioned, and they decouple the producer from the consumer. But unlike CDC, which gives you raw state changes, business events give you rich, meaningful data about why things happened. The warehouse isn't just storing numbers. It's storing the full business context—which campaigns drove revenue, which suppliers delivered shipments, which return reasons customers reported. This makes your analytics exponentially more valuable than anything you could get from reading database rows.
Use outbox pattern for reliability. If you're publishing events, use the transactional outbox pattern to ensure that events are published reliably. Write events to an outbox table in the same transaction as your business logic, then have a separate process poll the outbox and publish to the message broker. This guarantees that every state change results in an event, even if the broker is temporarily unavailable.
Provide batch exports on a schedule. If real-time extraction isn't critical, generate batch exports (CSV, Parquet, JSON) on a schedule and drop them in an S3 bucket or SFTP location. The ETL picks them up and processes them. The export format is your contract. You can version it, document it, and test it just like an API.
All of these approaches solve the performance problem while maintaining contract boundaries. Yes, they require more work upfront than just handing over database credentials. But they save you from months of coordination headaches later.
When Direct Database Access Might Be Acceptable
Direct database access for ETL is defensible in a few specific scenarios. Here they are.
Stable, mature systems with infrequent changes. If your application schema hasn't changed in two years and isn't likely to change in the next two years, the coupling risk is lower. The benefit of a contract layer might not justify the cost. But be honest about whether your system really fits this description. Most don't. If you're genuinely confident the schema won't change (regulated data, reference tables only), maybe. But if there's any chance of evolution, contracts are the safe bet.
Read-only replicas with explicit ownership. If the ETL reads from a dedicated read replica that's explicitly maintained as a stable interface, and the application team owns keeping that replica compatible, that's closer to a contract. The replica becomes the boundary. But you need clear ownership and SLAs, or you're back to the invisible coupling problem.
Prototyping and temporary integrations. If you're building a one-time analysis or a prototype dashboard, direct database access might be fine. The key word is temporary. If it's still running six months later and people are making business decisions based on it, it's not temporary anymore, and you need to convert it to a proper contract.
Small teams with tight coordination. If the application developers and the ETL engineers are the same people, or they sit together and coordinate closely, the coordination cost is lower. You can get away with direct access because the communication overhead is minimal. But as soon as teams split or people leave, you lose this advantage.
In every other scenario (which is most of them), contracts are worth the investment.
Making the Case to ETL Engineers
Here's how I make the case to ETL engineers:
Contracts protect you from us. When we refactor our database schema, we'll update the contract layer to maintain compatibility. You won't wake up to a broken ETL job because we renamed a table. The contract is your stability guarantee.
Contracts reduce your domain knowledge burden. You don't need to understand our business logic, our validation rules, or our data transformation pipeline. You consume the contract, which gives you the data in the form you need. We handle the complexity on our side.
Contracts enable parallel evolution. With a versioned contract, we can release a new version without breaking your existing integration. You can migrate to the new version on your schedule, not ours. No more coordinating deployments across teams.
Contracts are testable. We can write automated tests that verify the contract hasn't changed unexpectedly. You can write tests that verify your ETL handles the contract correctly. Both sides have confidence that changes won't break the integration.
Contracts are documented. An OpenAPI spec or a message schema is self-documenting. You know exactly what fields are available, what their types are, and what they mean. No more reverse-engineering our database schema and hoping you got it right.
The short version: contracts make your job easier in the long run, even if they feel like friction in the short term.
My Current Stance
I still hold the line on direct database access for ETL. Every time I've compromised on this, I've regretted it six months later when we're in a meeting trying to figure out why the warehouse data doesn't match the application and nobody can remember all the places that are reading the database directly.
The ETL engineers aren't wrong about the performance and complexity challenges. But the solution isn't to abandon contracts. The solution is to build contracts that actually work for ETL: bulk extraction endpoints, event streams, batch exports. Make the contract layer good enough that direct database access stops looking attractive.
This requires upfront investment. You need to design those endpoints, publish those events, and maintain that infrastructure. But the alternative is invisible coupling that will cost you far more in the long run.
I'm still learning better ways to make this work. What I've found is that the best contracts come from collaboration between the service team and the warehouse team. The service team knows the domain and the events. The warehouse team knows what questions they need to answer. When both sides define the contract together, you get something that actually serves the analytics use case without forcing the warehouse to reverse-engineer intent from raw data.
But the core principle doesn't change: if you want data from my bounded context, go through the contract. It protects both of us, and it's the only way I've found to keep distributed systems maintainable as they grow.
You May Also Like
Do You Still Need Wolverine When AI Can Write the Code?
Brad Jolicoeur - 04/12/2026
Scaling Out Wolverine: What I Learned Coming from Rebus and NServiceBus
Brad Jolicoeur - 04/12/2026
Architecting for Concurrency: Wolverine's Approach to Shared Resources