home - hermiod

PipeHub: ETL on Github Actions

This is an explanation and rationale for PipeHub, which will be an open-source package for running ETL and reverse ETL pipelines on GitHub Actions.

The landscape

ETL tools are really expensive and have opaque pricing.

If you go to Fivetran's pricing page, you won't see a single number that describes their pricing. Instead, they link you to a page on their legal site with a consumption table.

Good luck trying to figure out how much it will cost to set up a few different syncs!

This is true for other software in this space as well.

All of these tools have the inevitable "Contact Sales" or "Get a demo" button. Nobody wants to talk to a salesperson from an ETL company.

Open source isn't a panacea

Even if you wanted to run workflows on your own infrastructure to save costs, you're still in a difficult position.

Airbyte is another product in this space that has an open source and cloud offering.

Here's a diagram from the Airbyte Self-managed Enterprise guide on how to deploy Airbyte in AWS:

Diagram of Airbyte self-managed enterprise infrastructure requirements on AWS

Then, it goes on to describe how you need:

an EKS cluster in 2 or more availability zones
a minimum of 6 nodes
an ingress ALB
an s3 bucket
a dedicated RDS database (with read replica)
an external secrets manager

Just to get started.

Most customers of ETL products aren't able or willing to do this; they pony up for the cloud version.

Why is ETL hard?

ETL platforms have to:

Write hundreds of connectors
Maintain hundreds of connectors
Build a workflow scheduling and management engine
Allow some form of version management
Allow SSO and role-based access control support
Alert / notify customers when things go wrong
Orchestrate tons of running containers, provision new machines
Allow customers to monitor and read logs
Support pricing / tracking software
Secrets management

These things are all hard and require years of engineering investment.

GitHub?

Enter an unlikely hero: GitHub Actions.

Workflow scheduling and management

You can schedule workflows up to every five minutes or trigger them with webhooks.

Version management

It's GitHub. You can manage your connections as code.

Here's an example of how powerful GitHub-managed connections can be. You can create a PR with a new connection workflow and run the workflow with test data. That's really hard to do with existing tooling! And you get that out of the box.

Another example: users who notice a bug and know how to fix it can fork the action code, fix the bugs, and run the action with their fork until the fix is upstreamed.

Allow SSO and role-based access control

You can assign certain users to be able to view or have merge rights on repositories. GitHub already has these features.

Alert / notify on failures in workflows

GitHub can also already alert or notify you when your action fails!

Orchestrate containers on VMs

GitHub Actions already has tons of options for VMs. These include runners with 2TB SSDs and 256GB of RAM. This is enough for the vast majority of ETL workloads using something like DuckDB.

They are on the expensive side, but this is nothing compared to the pricing you're getting from ETL providers.

If you want to minimize cost or run into limits with the GitHub Action runners you can host your own, which is a lot easier than self-hosting something like AirByte.

Allow customers to monitor jobs and see logs

GitHub Actions allows you to see logs and monitor the jobs.

Support pricing

GitHub has pricing breakdowns as well! You can easily see and control how much you're spending and schedule fewer jobs.

Secrets management

GitHub Actions also supports storing and managing secrets and environment variables already.

PipeHub

Bolded are the features we need that GitHub Actions already has:

Write hundreds of connectors
Maintain hundreds of connectors
Build a workflow scheduling and management engine
Allow some form of version management
Allow SSO and role-based access control support
Alert / notify customers when things go wrong
Orchestrate tons of running containers, provision new machines
Allow customers to monitor and read logs
Support pricing / tracking software
Secrets management

GitHub Actions is too perfect for ETL. All you need is the connector code.

This is the vision of PipeHub: run open source connectors inside custom GitHub Actions.

All of your ETL pipelines are version controlled YAML in .github/workflows. No provisioning infrastructure. No surprise bills. No awkward sales negotiation.

Risks

GitHub platform risk

If this gets too annoying for GitHub (platforms banning their action IPs, customers using too many runners, etc.), they might start banning usage of the action.

My bet is the incentives are aligned here: providers want to their users to get value from their data and GitHub (based on their pricing) makes a solid margin on GitHub Action runners.

This is certainly the biggest risk.

Nobody cares

It's possible that users of Fivetran and similar tools aren't frustrated enough with pricing. I doubt this is the case: there are tons of complaints on the r/dataengineering subreddit about Fivetran and related tools.

The procurement process for tools like Fivetran is also really tough. You have to spend a ton of time in meetings begging for budget only to get denied.

Lots of teams already have GitHub subscriptions, and can spin up a new repo with a GitHub Action. Many workflows will be less than a dollar (or essentially free).

Connector Atrophy

One risk is that the connector code breaks over time and doesn't get fixed by the open source community.

It's open source so they can report the issue (and fix it if they're savvy enough)
This is a problem with proprietary ETL solutions as well. They often break and the ETL providers have trouble keeping up.

We'll probably need integration tests with real customer data to make sure we know about issues ahead of time. That will be tough and take a lot of resources.

Burnout

If PipeHub is successful, there will be hundreds of connectors with thousands of users relying on them for mission critical workloads. That's a huge burden to take on for any open source project, especially since certain issues (outages in connected systems, breaking API or schema changes without warning) are often out of our control.

There are a couple tailwinds that help this:

It's open source for a technical audience. They can pitch in since they'll be getting immense value.
LLMs make writing integrations way way way easier. You can get to a first draft faster. LLMs also make writing tests faster, which will help ship with confidence.

Ultimately, open source is really hard and having a path to a sustainable business will be really key.

Gemini Flash 2.0 Live API fails the vibe check for me

I had a couple conversations with the new Multimodal Live API through the Google AI Studio.

I wanted to have a voice-to-voice conversation about an idea I was working through, and wanted Flash to walk me through what the potential pitfalls are.

Flash didn't give me the sophisticated kind of response that I expect from Claude or ChatGPT. I specifically asked Flash to compare and contrast different AWS instances and it wouldn't list them out (despite it being pretty obvious that it would have that knowledge).

Based on the style of the voice, it sounds like the voice is generated from text-to-speech as opposed to "natively multimodal" right now. Some of the responses were a little awkward.

I'm sure it's just early, and they'll make improvements. It's also unfair to compare the "Flash" line of models with Sonnet and 4o.

I really just want Advanced Voice Mode with GPTs! I want to be able to orchestrate some basic workflows with my voice. Coming soon, I'm sure.

My favorite new prompting trick: chat prefix completion

I've been working with the Deepseek API while working on my tool carrier.nvim.

It has a new beta feature that Anthropic also has that is really excellent, called chat prefix completion. You can just set what the assistant message response should start with.

This is really useful for contexts like code completion, where you want the model to always respond with just the code you want by starting the response with three backticks and a newline (essentially saying the assistant response should start a code block).

https://platform.deepseek.com/api-docs/chat_prefix_completion

This makes getting Just the Code back consistently really easy for my tool! I highly recommend you try this out in your applications, especially if you're asking for code completions.