WednesdayTuesdayMondaySundaySaturdayFridayThursday

It's five grand a day to miss our S3 exit

ksec 186 points world.hey.com
ifightcrime
Cloud has always been more expensive. I remember being quoted 250k/month for bandwidth when I was paying 15k with rackspace 10+ years ago. You’re paying for convenience and speed. The Math stops working when you grow to a certain point.

You can mitigate this to some extent by making some key architecture + vendor decisions upfront when first building… or just consider that some day you’ll need to do things like this. It’s not a novel problem.

vidarh
It's horrifyingly hard to convince people of this, though, even you can present them with actual numbers.

A lot of people have convinced themselves that cloud is cheap, to the point that they don't even do a cursory investigation.

A lot of those even don't do the bare minimum to reduce hosting costs within the cloud they choose, or choose one of the cheaper clouds (AWS is absolutely extortionate for anything that requires significant amount of outbound bandwith), or put caching/CDN's in front (you can trivially slash your AWS egress costs dramatically).

Most of my consultancy work is on driving cost efficiencies for cloud, and I can usually safely guarantee the fee will pay for itself within months because people don't fix even the most low hanging fruit.

mbreese
I think one business argument for cloud is capital expenses vs operational expenses. If you’re (over) paying for cloud resources vs an in house option (or colo), those are numbers that are a straight expense. When you own hardware, those are on your books until they depreciate off. For some businesses, that can make sense.

Now, a good accountant probably wouldn’t care one way or the other. Debits and credits balance either way. And spending more still means less profit in the long term, no matter how it looks on the books. But, in addition to the flexibility, that was what I always thought of as the main cloud benefit. It’s the same with leasing vs buying cars/computers/etc…

tialaramex
Cloud made sense for the startup I worked for previously. If you are a startup then a $1M per year expense makes much more sense than a $5M up front purchase with 5-10 years of life - in five years you might be billionaires or you might be bankrupt and until then the Cloud was better.
bitmasher9
Cloud also makes sense with certain traffic patterns where peak requirements is a huge outlier but critical to satisfy.
mbreese
Or where locality is critical. Like if you have a game that hits peak traffic at different times throughout the day in different regions. So, a company may not want to own hardware in multiple regions, when they would only be at peak usage for a few hours.
vidarh
The variation in peak loads tends to be far smaller for most people than what they imagine, but indeed it can sometimes be cheaper. The window needs to be very short, though, to outweight the large cost differential. And you don't need to buy - you can rent.
bitmasher9
I don’t think the window length is the key metric. It’s the ratio of time between peak traffic and normal traffic time
vidarh
Yes, but in practice it's even rarer for people to have loads that have multiple spikes a day, to the point it's a rounding error you can mostly ignore outside of very unusual niches. Usually the day-night cycle entirely dominates in terms of traffic variations and is already long enough to make auto-scaling unviable in terms of cost.

You're right that if the longest window is short enough to make autoscaling financially beneficial over managed hosting, then you also need to make sure that you don't regularly have other spikes that can tip things back to being unprofitable.

dilyevsky
The classic case for this is Intuit which presumably needs most of its compute only two months a year. Not that many companies in the same boat though
sgarland
If your monthly cloud bill is < $100K, you don’t need $5MM worth of hardware. That cloud spend equates to, at best, a couple of beefy servers, which (modulo AI cards) could almost certainly be had for <= $50K/ea. So for $200K, you could have a two-zone setup.

Where it does make sense in the short-term for this scenario is the experience and knowledge necessary to reliably run your own servers. If you don’t have that, you may not want to invest the time and effort to do so. But on pure cost, unless your bill is on the order of a few thousand per month, cloud will never win. It can’t; they have to make money.

dilyevsky
Like sibling is saying it's not $5M upfront but on the order of 1 year cloud spend for large enough accounts. There are also such things as leases and loans.

One justifiable excuse is you simply don't know how much hardware you will need to buy if you're hitting hockeystick growth. That until you realize you can also go hybrid...

vidarh
Or you could have rented or leased-to-own. There's hardly ever any need to actually purchase outright to get prices far below equivalent capacity in clouds. In fact, in 25 years, only one of the colo'd server setups I've worked on had any hardware purchased up-front in it.
winkeltripel
It feels like if you're spending 1M in cloud per year, the hardware and colo will probably cost 1M to buy.
vidarh
But that too is based on people not knowing the alternatives, as renting managed servers can be close to a wash vs. leasing hardware for a colo (often to the point that relatively cost of land near your preferred managed hosting providers vs. colos that work with access to staff etc. might be what makes one or the other cheaper). Buying outright can be cheaper but isn't necessary.

None of the colo'd setups I've worked on bar one used purchased servers - it's all been leased. But the majority of non-cloud workloads I've worked on have not even been leased, but rented.

NorwegianDude
You don't have to buy the hardware. It's very common to rent it.
tialaramex
Periodically management says we shouldn't have a DC, just put everything in the cloud.

OK says HPC, here's the quote for replacing one of the (currently three) supercomputers with a cloud service. Oh dear, that's bigger than your entire IT budget isn't it? So I guess we do need the DC for housing the supercomputers.

If we'd done that once I'd feel like well management weren't to know, but it recurs with about a 3-5 year periodicity. The perception seems to be "Cloud exists, therefore it must be cheaper, because if it wasn't cheaper why would it exist?" which reminds me of how people persuade themselves the $50 "genuine Apple" part must be better because if it wasn't better than this $15 part why would Apple charge $50 for it? Because you are a sucker is why.

vidarh
Yeah, I used to be asked to price out a move to AWS every year at one position. After several years Hetzner finally got cheaper than operating our own colo's, but only basically because we were in London and London real-estate is expensive, and so colo space is accordingly expensive, while Hetzner's DC space is dirt cheap.

AWS, however, remained 2x-3x as expensive, with the devops time factored in.

The perception seems to be "Cloud exists, therefore it must be cheaper, because if it wasn't cheaper why would it exist?

People are also blithely unaware that large customers get significant discounts, and so I regularly has to explain that BigCo X being hosted in AWS means at most that it is cost-effective for them because their spend means they're getting a significant discount over the already highest volume published pricing, and my clients usually are nowhere close to spend enough to be able to get those discounts.

whstl
It's just people conflating popularity with <every positive attribute>.

If <service> is popular, it must also be cheap, beautiful, well documented, have every feature that exists and make you popular with your friends.

I once had a Product Manager try to start an argument with me: "Explain to me how it is possible that the service we pay 25k a month doesn't have <feature>. You don't know what you are saying.". It just didn't do what he wanted, and getting angry with them over the phone didn't magically made the feature appear.

azinman2
Apple may have markup, but the part is for sure more likely to be higher quality: https://www.cultofmac.com/news/apple-thunderbolt-4-cable-com...
JimDabell
Same goes for Apple power adapters:

https://news.ycombinator.com/item?id=28053398

throwaway894345
I think management is just prone to wanting to believe the grass is greener on the other side. If you are already a cloud org with negotiated pricing and cost-optimization management would ask about building a data center and you would show them how much you would need to expand your IT staff in order to acquire the skills to operate the new data center never mind the upfront cost.
koliber
Regarding apple parts, I recently replaced a broken screen on a MacBook pro with an OEM part. I can’t get the color to look right. Not to mention the one vertical row where pixels look off (not dead, but not normal either). The guy at the shop said I would not notice. I am now kicking myself for not going with the real thing.
magicalhippo
If we'd done that once I'd feel like well management weren't to know, but it recurs with about a 3-5 year periodicity.

So basically every time management changes[1]?

[1] https://maexecsearch.com/average-c-suite-tenure-and-other-im...

tialaramex
I work in HE so there's obviously much more turnover in senior management than the rest of the hierarchy but that's individual turnover, there's no need for Jim†, who arrived last week, to task HPC with gathering a quote that six of his subordinates and colleagues already know from last time shows this is a waste of time.

† Name changed to protect individuals but also because frankly I don't care very much who is currently doing these roles, there'll be others.

smugma
What’s HE? High energy physics?
tialaramex
Higher Education. A University. So, senior management are much the same as anywhere (although maybe with at least enough sense to realise that the mission is now different) but many other people are there because of the mission. Researchers, teachers, even if what you actually do is marketing, there's a very different sense of purpose behind that than if you were selling garden furniture.
diggan
A lot of people have convinced themselves that cloud is cheap

I've noticed this too, freelancing/consulting around in companies. I'm not sure where this idea even comes from, because when cloud first started making the news, the reasoning went something like "We're OK paying more since it's flexible, so we can scale up/down quickly", and that made sense. But somehow today a bunch of people (even engineers) are under the belief that cloud somehow is cheaper than the alternatives. That never made sense to me, even when you take into account hiring people specifically for running the infrastructure, unless you're a one-person team or have to aggressively scale up/down during a normal day.

bostik
I can provide an example where cloud, despite its vastly higher unit costs, makes sense. Analytics in high finance (note: not HFT). Disclosure: my employer provides systems for that.

A fair number of our clients routinely spin up workloads that are CPU bound on hundreds-to-thousands of nodes. These workloads can be EXTREMELY spiky, with a baseload for routine background jobs needing maybe 3-4 worker nodes, but with peak uses generating demand for something like 2k nodes, saturating all cores.

These peak uses also tend to be relatively time sensitive, to the point where having to wait two extra minutes for a result has real business impact. So our systems spin up capacity as needed, and once the load subsides, terminates unused nodes. After all, new ones can be brought up at will. When the peak loads are high (& short) enough, and the baseload low enough, the elastic nature of cloud systems has merit.

I would note that these are the types of clients who will happily absorb the cross-zone networking costs to ensure they have highly available, cross-zone failover scenarios covered. (Eg. have you ever done the math on just how much a busy cross-zone Kafka cluster generates in zonal egress costs?) They will still crunch the numbers to ensure that their transient workload pools have sufficient minimum capacity to service small calculations without pre-warm delay, while only running at high(er) capacity when actually needed.

Optimising for availability of live CPU seconds can be a ... fascinating problem space.

vidarh
There are absolutely plenty of spaces where this is true and cloud makes sense either because it's actually cost effective, or because the cost doesn't matter.

Most people aren't in those situations, though, but I think a lot of them think they're much closer to your scenario than the much more boring situation they're actually in.

vidarh
I think it's because people think their workloads are extremely spiky, and so assume they will spin up/down loads enough to save money, and that has translated into cloud being perceived as cheap.

But devs rarely pay attention to metrics. I've had clients with expensive Datadog setups where it was blatantly obvious that nobody had ever dug into the performance data, because if they did they'd have noticed that key metrics were simply not fed to it.

If they did pay attention, most of them would realise that their autoscaling rarely kicks in all that much, if at all. Often because it's poorly tuned, but also because most businesses see small enough daily cycles.

Factor in that the cost difference between instances vs. managed servers is quite significant, and you need to have significant spikes much shorter in duration than most businesses day/night variation to save money.

It can make sense to be able to spin up more capacity quickly, but then people need to consider that 1) a lot of managed hosting providers has hardware standing by and can automatically provision it for you rapidly too - unless you insist on only using your own purchased servers in a colo, you can get additional capacity quickly, 2) a lot of managed hosting providers also have cloud instances so you can mix and match, 3) worst case you can spin up cloud instances elsewhere and tie it into your network via a VPN.

Some offer the full range from colo via managed servers to cloud instances in the same datacentres.

Once you prep for a hybrid setup, incidentally, cloud becomes even less competitive, because suddenly you can risk pushing the load factor on your own/managed servers much closer to the wire, knowing you can spin up cloud instances as a fallback. As a result, the cost per request for managed servers drops significantly.

I also blame a lot of this on business often shielding engineering from seeing budgets and costs. I've been in quite senior positions in a number of companies where the CEO or CFO were flabbergasted when I asked for basics costing of staff and infra, because I saw it as essential in planning out architecture. Engineers who aren't used to seeing cost as part of their domain will never have a good picture of costs.

dagw
I've noticed this too, freelancing/consulting around in companies. I'm not sure where this idea even comes from

Internal company accounting can be weird and lead to unintuitive local optima. At companies I've worked at, what was objectively true was that cloud was often much cheaper than what the IT department would internally bill our department/project for the equivalent service.

SteveNuts
paying more since it's flexible, so we can scale up/down quickly

I’ve heard this argument too and I think I’ve seen exactly one workload where it actually made sense and was tuned properly and worked reliably.

graemep
For smaller businesses it seems to be its the safe option because its what everyone does.

I have even had it suggested that it might make selling a business or attracting investors harder if you used your own servers (not at the scale of having your own datacentre, just rented servers - smaller businesses still).

Another thing that comes up is that it might be more expensive but its a small fraction of operational expenses so no one really cares.

whstl
For smaller businesses it's often "the only thing Joe knew when he was building it".
j45
You have a great point about finding cost efficiencies - there was a time cloud was cheaper.

Maybe it's an understanding that doesn't change because the decision makers were non-techincal people (when finance oversees IT despite not understanding it)

Virtualizing and then sharing a dedicated server as a VPS was a big step forward.

Only, hardware kept getting cheaper and faster, as well as internet.

vidarh
when finance oversees IT despite not understanding it

... and when IT often do not even get to see the spend, and/or isn't expected to.

I've had clients where only finance had permissions to get at the billing reports, and engineering only ever saw the billing data when finance were sufficiently shocked by a bill to ask them to dig into it - at which point they cared for long enough to get finance off their backs, and then stopped caring again.

ksec
Partly because AWS give out a lot of free credit for start ups, and basically allow them to grow without planning any infrastructure. VCs who are invested into Amazon also wants to push the cloud narrative. Starts up who dont want to deal with servers, want massive scale when they think the website and later an app went viral.

That was in the late 00s and early 10s. PHP, Python, Ruby and even Java were slow. Every single language and framework has had massive performance improvements in the past 15 to 20 years. Anywhere from Java 2x to Ruby 3 - 10x.

When a server max out at 6 - 8 with Xeon core, compare to today at 192 Core. Every Core is at leats 2 - 3x faster per clock, with higher clock speed we are talking about 100x difference. Especially when IO used to be on HDD, SSD is easily 1000x faster. What used to wait for I/O is no longer an issue, the aggregate difference when all things added together including software could be 300x to 500x.

What you would need 500 2U server in 2010, you could now do it in one.

Modern web developers are so abstracted with hardware I dont think many realise what sort of difference in hardware improvements. I remember someone posted before 2016 Basecamp had dozens of Racks before moving to cloud. Now they have grown a lot bigger with Hey and they are only doing it with 8 racks and room to spare.

AWS on the other hand is trying to move more workload to ARM Graviton where they have a cost advantage. Given Amazon's stock price are now dependent on AWS, I dont think they will lower their price by much in the future. And we desperately need some competition in that area.

jinjin2
Yes. We saved ridiculous amounts of money (and made it a lot faster) by moving our analytics workloads from Snowflake to a few bare-metal nodes running Exasol. But it took months to convince management even though we had clear numbers showing the sheer magnitude of the cost reduction. They had drunk the cloud kool-aid, and were adamant that it would be cheaper, numbers be damned.
dangus
It’s more than mere “convenience.” You’re also paying to avoid hiring a bunch of employees to physically visit data centers around the globe.

And if you’re not doing that you are hiring a bare metal servers provider that is still taking a portion of the money you’d be paying AWS.

Even if you don’t need to physically visit data centers thanks to your server management tools, the difference in the level of control you have between cloud and bare metal servers is large. You’re paying to enable workflows that have better automation and virtual networking capabilities.

I recently stood up an entire infrastructure in multiple global locations at once and the only reason I was able to do it in days instead of weeks or months was because of the APIs that Amazon provides that I can leverage with infrastructure automation tooling.

Once you are buying AWS reservations and avoiding their most expensive specialized managed products the price difference isn’t really worth trying to recover for many types of businesses. It’s probably worth it for Hey since they are providing a basic email service to consumers who aren’t paying a whole lot. But they still need something that’s “set it and forget it” which is why they are buying a storage solution that already comes with an S3 compatible API. So then I have to ask why they don’t save even more money and just buy Supermicro servers and install their own software? We all know why: because Amazon’s APIs are where the value is.

There is a lot of profit margin in software and usually your business is best spending their effort working on their core product rather than keeping the lights on, even for large companies. Plus, large companies get the largest discounts from cloud providers which makes data centers even less appealing.

“Convenience” isn’t just convenience, it’s also the flexibility to tear it all down and instantly stop spend. If I launch a product and it fails I just turn it off and it’s gone. Not so if I have my own data center and now I’ve got excess capacity.

luckylion
I agree, but I don't think you're in the majority. I don't think most cloud-customers are utilizing all of those additional things that a big cloud provider offers.

How many are actually multi-region? How many actually do massive up/down-scaling on short notice? How many actually use many of those dozens to hundreds of services? How many actually use those complex permissions?

My experience tells me there are some, but there are more who treat AWS/GPC/Azure like a VPS-hoster that's 5-10x more expensive than other hosters. They are not multi-region, they don't do scaling, they go down entirely whenever the AZ has some issues etc. The most they do is maybe use RDS instead of installing mysql/pgsql themselves.

dangus
A lot more than you’re giving them credit for.

This idea that their basic users go down entirely when the AZ has some issues is ridiculous, a standard autoscaling group and load balancer basically forces you to be multi-AZ. Very much unlike a VPS.

Using RDS instead of self-installing SQL eliminates the need for an entire full time role for DB admin. So that’s kind of a big deal despite it being a “basic” use case.

A lot of services like ECS, elastic beanstalk, can make it so that you can wait longer to hire operations people and when you do they can migrate to more scalable solutions without having to do a major migration to some other provider or build up a self hosted solution custom. If you outgrow a VPS you have to do a major migration.

And if you take a look at the maturity and usefulness of the terraform providers SDKs, and other similar integrations of VPS and bare metal providers they are very basic when comparing to BOTO and the terraform provider.

I struggle to replicate the level of automation I can achieve with these cloud tools on my own homelab with Proxmox.

bigfatkitten
Using RDS instead of self-installing SQL eliminates the need for an entire full time role for DB admin.

No it doesn't. The value in a skilled DB admin is not in keeping the DB up and running, because no special skills are required to do that; the DB admin is an expert in performance. They add considerable value in ensuring you get the most bang for your buck from your infrastructure.

A popular modern alternative to this of course is to throw more money at RDS until your performance problems go away.

dangus
Yes it does.

While you’re not wrong about DB admins being important for performance optimizations, RDS stops you from having an inexperienced administrator lose data in stupid ways.

I know because I used to be that stupid person. You don’t want to trust your company’s data to a generalist that you told to spin up a database they’ve never configured before (me) and hope they got good answers when they googled how to set up backups/snapshots/replication.

sgarland
Amen. How this lie continues to be perpetuated as gospel is beyond me.

I can look at any company’s RDBMS who doesn’t have a full-time DB[A,RE] on staff and find ten things wrong very quickly. Duplicate indices, useless indices, suboptimal column types, bad or completely absent tuning, poor query performance…

It’s only when a company hits the top end of vertical scaling do they think, “maybe we should hire someone,” and the problem then is that some changes are extremely painful at that scale, and they don’t want to hear it.

j45
Multi AZ is as much planning as anything else.

IaaS (Proxmox) is a different layer than PaaS as we know.

The same orchestration tools (Terraform) can orchestrate Proxmox or other hypervisors just fine. Discounted licenses for VMware are readily available on ebay if that is preferred.

Proxmox has built-in node mirroring between multiple servers, it just works after it's connected.

scarface_74
I can’t speak too much for small companies. But there are a lot of large enterprises and smaller businesses and government agencies that do use more AWS services than just compute + storage + web services. Do need the elasticity etc.

For instance, I was surprised how large the market was for Amazon Connect - Amazon’s hosted call centers. It’s one of the Amazon services I have some experience in and I still get recruiters contacting me for those jobs even though I don’t really emphasize that specialty.

My experience is from 7 years of working with AWS. First at a startup with a lot of complex ETL and used a lot of services. But the spend wasn’t that great.

My next 5 years was between working at AWS (Professional Services) and two years at a a third party consulting company (full time) mostly as an implementation lead.

Even though my specialty is “cloud native application development” and I avoid migrations like the plague, most of the money in cloud consulting are large companies deciding to move to the cloud because they decided that the redundancy, lower maintenance overhead, and other higher level services were worth it.

bigfatkitten
How many are actually multi-region?

The fact half the internet seems to fall over whenever us-east-1 has a hiccup is quite telling.

j45
This might be a little incomplete.

It's trivial, to get equipment at a datacenter, where the equipment is visited for you on your behalf if you wish.

You can place your own equipment in a datacenter to manage yourself (dedicated servers).

You can have varying amounts of the hardware up to the software layer managed for you as a managed server, where others on site will do certain tasks.

Both of these can still be cheaper than cloud (which provides a convenience and a large markup to make often open source tools easy to administer from a web browser), and then paying someone to manage the cloud.

Global location at once can still be done with the reality of hybrid-cloud or cloud-agnostic setup requirements (not to be tied to one cloud only for fallback and independence).

peeters
The reality is when you get to another certain point (larger than the point you describe) you start negotiating directly with those cloud providers and bypass their standard pricing models entirely.

It's the time in between that's the most awkward. When the potential savings are there that hiring an engineering team to internalize infrastructure will give a good return (were current pricing to stay), but you're not so big that just threatening to leave will cause the provider to offer you low margin pricing.

All I'd say is don't assume you're getting the best price you can get. Engineers are often terrible negotiators, we'd rather spend months solving a problem than have an awkward conversation. Before you commit to leaving, take that leverage into a conversation with your cloud sales rep.

jonatron
At what sort of scale can you do that? $1M, $10M, $100M, $1B?
tecleandor
In my experience with GCP, go through a Google partner (that will aggregate multiple clients to get discounts) and you'll be able to get commitment discounts with $500K/year or even less. But don't save too much money during your commitment period: if you don't expend your commitment, you'll pay for it anyway, and you might even lose some discounts.

Also, one trick to inflate your commitment expenses is asking your SaaS providers if it's possible to pay them through AWS or GCP marketplaces: it often counts against your commitment minimum expense, so not everything has to be instances and storage.

dilyevsky
You can commit right there in the console - no need to work with a partner unless you want “flex” commit where saving is less. Even with 3y commit its still nowhere near cheap compared to buying servers and renting colo space especially for bandwidth and storage
tecleandor
It's not the same commitment. When doing a commitment through a partner, you're doing an expense commitment (let's say, 600k in a year) in ALL your expenses. Well, except for Google Maps API it seems :P. So not tied to an specific product or type of instance, as the typical commitment, but to your whole GCP billing.

From this, you get a wide range of discounts in a bunch of products, not just instances. And I think those discounts go on top of some of the other discounts you regularly have, but I'm not sure and I'd had to check our billing.

peeters
So obviously this is an extreme, but I worked for a company that had long dismissed third party cloud providers as too expensive (customers would be routing all of their network traffic through our data centers, so obviously the bandwidth costs would just be too dang high). Then that company got purchased by a certain mega corporation who then negotiated an exclusive deal with GCP, and the math flipped. It was now far too expensive to run our own set of datacenters. Google was willing to take such a low margin on bandwidth that it made no sense not to.

So in this case, hundreds of billions. But the principle stands at lower company sizes, just with different numbers and amounts of leverage.

sokoloff
hundreds of billions

That doesn't seem right. GCP's entire run rate is around $50B/yr.

peeters
Sorry I was giving the company's size, not their spend.
sokoloff
I don’t remember if our first enterprise agreement was at $1M or $2M, but it was low and in that neighborhood [but also 10 years ago, well before cloud was the default and had growth baked into it].

Cloud providers are looking for multi-year term, commitment to growth as much as/more than exact spend level now.

TrueDuality
Even with the discounts of volume pricing cloud prices are still quite inflated unless you need to inherit specific controls like the P&E ones from FedRAMP High/GovCloud. The catch there is lock-in technologies that may require to re-develop large swaths of your applications if you're heavily reliant on cloud-native tools.

Even going multi-region, hiring dedicated 24/7 data center staff, and purchasing your own hardware amortizes out pretty quickly and can you a serious competitive advantage in pricing against others. This is especially true if you are a large consumer of bandwidth.

aleph_minus_one
Engineers are often terrible negotiators, we'd rather spend months solving a problem than have an awkward conversation.

My experience is the opposite: lots of software developers ("engineers") would love to do "brutal" negotiations to fight against the "choking" done by the cloud vendors.

The reason why you commonly don't let software developers do these negotiations is thus the complete opposite: they apply (for the mentioned reasons) an ultra-hardball negotiation style (lacking all the diplomatic and business customs of politeness) that leads to vast lands of burnt soil. Thus, many (company) customers of the cloud providers fear that this hardball negotiation style destroys any future business relationship with the respective (and perhaps for reputation reasons a lot of other) cloud service provider(s).

dbbk
But the article states they negotiated.
peeters
This was more a response to the comment I replied to, that cloud is always more expensive. And saying it more for everyone, not OP.

It's almost always less expensive at the start, which is super important for the early stages of a company (your capital costs are basically zero when choosing say AWS).

Then after you're established, it's still cheaper when considering opportunity costs (minor improvements in margin aren't usually the thing that will 10x a company's value, and adding headcount has a real cost).

But then your uniqueness as a company will come into play and there will be some outsized expense that seems obscene for the value you get. For the article writer, it was S3, for the OP, it's bandwidth. For me it's lambdas (and bizarrely, cloud watch alarms). That's when you need to have a hard look and negotiate. Sometimes the standard pricing model really doesn't consider how you're using a certain service, after all it's configured to optimize revenue in the general case. That doesn't mean the provider isn't going to be willing to take a much lower margin on that service if you explain why the pricing model is an issue for you.

bigfatkitten
Even starting out, with used/refurbed hardware you can put a lot of compute power into a colocation facility for very little money.
diggan
The reality is when you get to another certain point (larger than the point you describe) you start negotiating directly with those cloud providers and bypass their standard pricing models entirely.

And even if you do, you still end up with pretty horrible pricing, still paying per GB of "premium" traffic for some outrageously stupid reason, instead of going the route of unmetered connections and actually planning your infrastructure.

ksec
Sounds like the trap for Middle Class.
selfhoster
It's the time in between that's the most awkward.

That's an odd way to describe hemorrhaging money.

binarymax
Even without your own rack or colo, The math with AWS stops working as soon as you no longer fit in the free tier, since providers like Hetzner are 40% cheaper.
gizmo
S3 is designed for 99.999999999% durability. Hetzner's Volume storage is just replication between 3 different physical servers.

In terms of durability that's a universe apart.

SteveNuts
S3 is beyond impressive, but how many workloads truly need that? I’ve never had a single instance of data loss on a NetApp or Pure array.
jessekv
Truly need, I don't know. But customers will request (and pay for) the 9's.
SteveNuts
I suppose it’s like how someone who’s already made up their mind to buy a Lamborghini never questions whether they really need a 800HP engine.
sebazzz
On the other hand you have transient failures in the cloud (at least on Azure - this behavior is even documented) so does that count towards the 99.99999%?
jwiz
That sounds like accessibility, not durability.
t0mas88
Around that certain point you can also talk to AWS or GCP and get very significant discounts. I'm surprised 37signals and AWS didn't find a number that worked for both.

I've seen a few of these deals with other vendors up close, the difference with public pricing is huge if you spend millions per year.

bigfatkitten
DHH has said previously that they already have a very good deal when compared with list price. But AWS still couldn't come close to on prem costs.
dilyevsky
I worked/consulted for several companies who had multimillion per year cloud commits, sometimes with different clouds, and those discounts are not competitive with onprem like at all
j45
If it takes talking to them to get discounts, might as well look at all the options and get the real discount of not being on the cloud.
djha-skin
It's not a novel problem but it _is_ a relatively novel (bad) economic environment. We've been in "let the good times roll" mode longer than ten years. In comparison to 2009-2011, it was different. Many ops professionals are younger than that and have gone their entire careers without doing anything on premise.

I remember trying to convince some very talented but newly minted ops professionals -- my colleagues -- to go on prem for cost. This was last year. They were scared. They didn't know how that would work or what safety guarantees there would be. They had a point, because the org I was at then didn't have any on prem presence, since they were such a young organization that they started in the cloud during "the good times". They always hired younger engineers for cost, so nearly no one in the org even knew how to do on prem infra. Switching then would have been a mistake for that org, even though cloud costs (even with large commit agreements) were north of six figures a month.

tomrod
What do you mean "for cost" in your comment? For cost savings / frugal purposes? Or using something like a sweetheart deal with a PEO?
hodgesrm
The Math stops working when you grow to a certain point.

That point is different for every business. For most of them it depends on how big cloud is in your COGS (cost of goods sold) which affects gross margins, which in turn is one of the most meaningful measures of company financial health. Depending on the nature of your business and the amount of revenue you collect in sales, many companies will never reach the point where there's measurable payback from repatriating. Others may reach that point, but it's a lower priority than other things like opening up new markets.

Many commenters seem to hold very doctrinaire opinions on this topic, when it's mostly basic P&L math.

jstummbillig
I find it intuitively absolutely bizarre that Cloud does not outright win at any scale. In my mind everything about it seems more optimizable with more scale. Obviously I am missing something, but all Cloud pricing looks so significantly more expensive than I feel it should in a healthy and mature market.
rvz
$1.5 million/year

That is excessive and it's already $4K a day.

Lots of teams really underestimate cloud costs since there is an assumption that the hundreds of millions they are raising will give them enough runway to survive a few years despite losing money for years.

Even scaling would be somewhat of an issue depending on the tech stack. Imagine the cost of running standard Java micro-services and the "solution" was to "spin up hundreds of more nodes". The worst that I have seen was a bank proudly having up to 8,000 - 10,000 separate micro-services.

Just imagine the daily cost of that. Unjustifiable.

But of course the AWS cloud consultants would be happy to shill you their offerings at "cheap" prices, but in reality the pricing is designed for you to accumulate millions in costs as you scale on the tiniest amount of usage, even for testing.

So before you build the software, one must think about the costs of scaling if it becomes widely used rather than taking the easy approach in just spinning up nodes and increasing more costs and act as if you don't have the capital to solve the problem. You can only do that for so long until you don't.

whstl
I remember that at a previous company somehow it leaked that the AWS cost was 50% of all the developers staff salary.

There was nowhere near the same volume of data as Basecamp/Hey, not there was much processing power needed. It was purely bad engineering accumulated over 10 years.

lpapez
I was once contracted to work on a project where the monthly GCP bill for Postgres was $60k per month - this was basically my YEARLY rate at that time, just for managed Postgres.

After some time I was quite familiar with their stack and had gathered considerable domain experience. This led to an idea how to halve the database load (and the cost would presumably fall by a similar percentage), which I wanted to use as leverage during contract renegotiation.

I boldly offered to work for free to halve their database load, in exchange for being paid half the money this optimization would save over the course of one year. This would basically triple my pay, and they would still save money.

They declined, and I moved to a better opportunity.

Last I heard they had to pay a team of 4 new consultants for a year to implement the same idea I had. Without the domain knowledge, consultants couldn't progress as fast as I suspect I could have done (my estimated was 2 months of work).

I know it's very petty, but I regret revealing too many implementation details of the idea during the pitch and allowing the company to contract other consultants to see it done.

vidarh
I've made similar pitches to clients many times, and one thing I've learned is that ironically the problem is promising the actual saving, vs. offering a much smaller saving.

The challenge is that people don't believe you when you tell them they can save that much, no matter how evidence you prepare. I'm starting a sales effort for my agency right now, and one of the things we've worked on is to promise less than what we determine we can deliver after reviewing the clients costs, and raising our prices, because it's ironically easier to close on the basis of a promise to deliver 20%-30% savings at a relatively high cost than a promise to deliver 50%+ with little effort.

skrebbel
If you built up that domain knowledge while being paid top dollar per hour by the same company, then I understand their reluctance to go along with your offer. It feels a little bit extortionate to be honest. I wouldn't go along with it either, not because it's a bad deal in isolation, but because it sets a bad precedent. It basically tells every employee/contractor that if they know a way to add a lot of measurable value, they can use that as a bargaining chip to 3x their pay. This also discourages trying to add any value that isn't as easily expressed in dollars (which is the case for many important things, such as product quality improvements).

I think part of the expectation when contracting somewhere long-term (or just being an employee, for that matter) is that the amount of value you add per hour worked increases sharply over time, and slower than your fee. In other words, initially you're overpaid wrt your value-add, and then that corrects itself over time as you figure out what the company is all about.

sgarland
My current and last jobs had monthly RDBMS bills in excess of $1 million/month. It is staggering. We could buy two fully-loaded 42U racks in separate DCs and be net positive after a few months. I’ve done the math, in great detail.

No go. “It’s hard to hire for that skill set.” Is it $9 million/year hard?! You already have a team lead – me. This shit is not that hard; people will figure it out, I promise.

j45
It's not petty. It's the profit margin average solutions get to make over expertise with individual tools, or groups of tools.
whstl
> I know it's very petty

No it isn't.

Aperocky
Is 50% that bad? If instead you hire engineer to maintain access to some kind of file storage on the internet, would it cost more or less?

Would be alarming if it is 500% the staff salary, but at 50% that just seems the cost of outsourcing to standard that likely won't be achieved in house.

whstl
Considering it was about 100 developers, it was horrible.

The two major problems were:

1. The volume of data itself was not that that big (I had a backup on my laptop for reproductions), but it was just too heavy for even the biggest things in AWS. Downtimes were very frequent. This is mostly due to decisions from 10 years ago.

2. Teams constantly busy putting out fires but still getting only 1-2% salary increases due to lack of new features.

EDIT: Since people like those war stories. The major cause for the performance issues was that each request from an internal user would sometimes trigger hundreds of queries to the database. Or worse: some GET requests would also perform gigantic writes to the Double-Entry Accounting system. It was very risky and very slow.

This was mostly due to over-reliance on abstractions that were too deep. Nobody knew which joins to make in the DB, or was too afraid, so they would instead call 5 or 6 classes and joining manually causing O(N^2) issues.

To give a dimension of how stupid it was: one specific optimization I worked on changed the rendering time of a certain table from 25 seconds to 2 miliseconds. It was nothing magic.

I'm glad I left.

Aperocky
That does sound like an engineering problem more than anything.

On an off note, migrating to nosql might not have a lot of on paper benefit, but it does enforce developers to design their table and queries in a way that prevents this kind of query hell. Which might be worth it on its own.

sgarland
How does NoSQL (and which flavor are you referring to?) enforce that? RDBMS enforces it in that if you don’t do it correctly, you get referential integrity violations and performance issues. You’d think that would be enough to motivate devs to learn it, but no, let’s use more JSON columns!
Aperocky
It's the human aspect of engineering, you can't join 15 different tables just by running an 200 line SQL command in nosql and this manual burden forces a re-thinking in what the acceptable design is.

relational DB is great, but just like java design pattern, it's being abused because it could be. People are happy doing stuff like that because it was low resistance and low effort, with consequences building up in the long term.

whstl
In my example the abuse was on the OOP part, not in the relational database part.

Database joins were fine, they just weren’t being made in the database itself, due to absurd amounts of abstraction.

I don’t disagree that rethinking the problem with NoSql would solve it (or maybe even would have prevented it), but on the other hand I bet having 5 layers of OOP could also mess up a perfect NoSql design.

vidarh
My experience from offering devops services on retainer to a number of clients is that the ones that host in cloud environments spend more money on me for similar scale setups than the ones that host on managed setups.

And even if you don't want the hassle of storing the data yourself, there are many far cheaper outsourced options than S3.

iLoveOncall
Even scaling would be somewhat of an issue depending on the tech stack. Imagine the cost of running standard Java micro-services and the "solution" was to "spin up hundreds of more nodes". The worst that I have seen was a bank proudly having up to 8,000 - 10,000 separate micro-services. Just imagine the daily cost of that.

I'm not going to preach for thousands of micro-services necessarily, but they also make scaling easier and cheaper.

Not every service in your application receives the same load, and being able to scale up by increasing the 20% of Lambdas that receive 80% of the traffic, will result in massive savings too.

owebmaster
but they also make scaling easier and cheaper.

Easier is arguable but cheaper is not, for sure.

Nextgrid
the hundreds of millions they are raising will give them enough runway to survive a few years despite losing money for years.

It's more that the decision makers at every stage are not incentivized to care, or at least, were not during the ZIRP period. This is slowly changing, as evidenced by more and more talks of "cloud exits".

Software engineers are encouraged by the job market to fill their resume with buzzwords and overengineer their solutions.

Engineering managers are encouraged by the job market to increase their headcount, so complicated solutions requiring lots of engineers actually play in their favor.

CTOs are encouraged by the job and VC funding market to make it look like their company is doing groundbreaking things and solving complex problems, so overengineering again plays in their favor. The fact these problems are self-inflicted doesn't matter, because everyone is playing the same game and has no reason to call them out for it.

Cloud providers reward companies/CTOs for behaving that way by extending invites to their conferences, which gives the people involved networking opportunities and "free" exposure for the company to hire more engineers to fuel the dumpster fire even more.

no_wizard
Testing in particular is something I hate about AWS being the most egregious.

You don’t get any testing services baked into the pricing, you’re paying production pricing for setting up / tearing down environments for testing. They have little to nothing in the ways of running emulators locally for services and it leads to other solutions of varying quality.

It’s outrageous and something i will always hold against AWS forever. Not to mention their CDK is for shit. Their APIs are terrible and poorly documented. I don’t know why anyone chooses them still other than they seem to have the “nobody got fired for choosing AWS” effect.

Azure is really good at providing emulators for lots of their core services for local testing for instance. Firebase is too, though I can’t vouch for the wider GCP ecosystem

4ndrewl
This is where your choice of which cloud services to use comes into play - Containerised web apps with Postgres on RDS? Simple to move off onto self hosting _if_ you can prove a business model that needs scaling. All-in on some proprietary services - less so.
Nextgrid
a bank proudly having up to 8,000 - 10,000 separate micro-services

Monzo in the UK?

comrade1234
Years ago (10+) when I was deciding to use aws or hosting our own hardware in a colo for my company I found a spreadsheet where you plug in your hardware and hosting costs and the current Amazon pricing for their various products and it would give you how long to break-even on the initial investment.

It was always just above three years to break even. After we decided to self-host I still kept tracking the prices in the spreadsheet and as hardware costs fluctuated Amazon adjusted their prices to match. I figured someone at Amazon must have had the same spreadsheet I was working with.

rco8786
I'm sure Amazon has mountains of data to arrive at that target number. But subjectively, it also feels about right in terms of juicing just the most profit margin from customers without being obviously overpriced. It's a nice "sweet spot" where it's just too far out on the horizon for companies, especially SMBs, to really account for or to think the up front cost/effort is currently worth it vs putting that effort into revenue generating activities.

Hardware cycle is probably about 3 years also.

Gigachad
Business needs change often enough that it’s hard to justify locking something in for three years vs paying a small amount extra for flexibility.
vidarh
Sounds way too long unless you use no extra services, and no outbound bandwidth to speak of.

AWS outbound bandwidth costs in particular is tens of times higher than what you can get elsewhere, to the point that when clients insist on S3, e.g. because they're worried about durability (which is a valid consideration), I usually ask if they'd be happy to put a hefty cache at a cheaper provider in front - if you use lots of bandwidth, it's not usual for it to be cost effective to cache 100% of the dataset somewhere cheaper just to avoid AWS bandwidth charges.

spockz
Is that including or excluding write off? If you have to replace the hardware every 3 years then it would be equivalent. IIRC hardware is replaced every 3-5 years because otherwise it is out of support.
j45
Write off is am accounting term, not operational.

Hardware like cars and laptop can continue to perform after they are written off, or even after the warranty.

The grade of hardware used is critical in servers.

Hyperscaling might mean commodity based servers. Hosting a large app does not mean using commodity component servers.

Hardware, when self hosting, does not need to be replaced every 3-5 years because it does not fail every 3-5 years. Depends on load and a bunch of factors.

Why?

We wouldn’t buy the cheap and disposable components a massive cloud or social media network might use to scale faster because they have a massive budget.

Besides, do providers really replace all their servers every 3-5 years? Hosting companies don’t seem to.

The cloud is many multiples more expensive than self hosting especially at scale. Hosting and cloud tools have brought down labour costs tremendously.

For the hardware, with the extremely clean environments servers run in, plus much cleaner electricity hardware runs much longer.

Purchase actual enterprise grade servers (HP Proliant, etc) that a company would buy for themselves for maximum reliability (compared to the commodity based ones of clouds) and those have so much reliability built into them that they sometimes never die.

You can still buy used proliant servers many, many, many generations old and they hum along just fine. It is bizarre but not.

Support is a few things. Warranty on parts and software. Extended support options which amounts to (hardware warranty) are always available for a fee, and achievable on your own.

If your software is a hypervisor you will be mirrored.

If a server has an issue the affected machine moves the load elsewhere.

The server has hot swap equipment. Takes a few moments to swap components if needed.

If you are self hosting, you can buy a used server or two or theee to have a backup and mirror and spare parts. It’s like buying a few NUCs.

Hosting corporately can be done not just with buying, but leasing too (meaning hardware swapping can happen). Add to this moving older equipment to less demanding tasks (if they ever do stay at load)z

vidarh
Write off is am accounting term, not operational.

That's the point. I've just decommisioned 10 year old servers for a client. They were still working fine, but the system had finally been replaced.

If you're calculating break-even based on the rate at which you're writing off the accounting value of the servers, you'll end up with a far longer time to break-even than if you amortise the hardware cost over the projected actual lifetime of the hardware.

j45
Hmm. First, my sentence had started with a typo.

Depending on the jurisdiction (this seems to be common), equipment can be written off at a faster depreciation schedule than it's been kept or used for.

In that way, writing off is often one part maximizing depreciation schedule (to "write it off" as an asset in you business as quick as possible), and another part is how long does it take.

Insert stereotype of bean-counting and propellers-spinning.

This means it's perfectly possible to use equipment after it's been written off, and be in a position to re-purchase it when it fails.

Spreadsheets can be a disease this way, written for the single scenario it's evaluating and not entirely enough scenarios or forecasts.

We should assume multi-billion dollar clouds do not use single spreadsheets to understand how they make 5-10x (or higher) off the same server resources by selling them as individual API calls.

The markup on cloud services can be so astronomically high, having been around data centre hosting for your own bare metal servers, virtualized servers, going to the cloud 1000% and now realizing it's gotten much easier to self host personally, and professionally (with experience).

The assumptions of why one might use a cloud originate with the update of the cloud and remain anchored there, regardless of the changes and evolutions in place.

vidarh
Sure, but again, the accounting write-off period is entirely orthogonal to how you choose to calculate your break-even point, so the accounting write-off period is really a distraction. But the "default" of 3 years is often used without much thought when customers of the cloud providers evaluate pricing, and so for the cloud providers it makes sense to make themselves look plausibly competitive when customers look at the numbers that way.

A lot of the other markup is obscured by splitting things into multiple categories, such as costs per requests, separate pricing for bandwidth etc. A lot of clients I talk to don't understand the pricing of the services they run, and the developers usually both don't know and don't care.

sgarland
That’s an ops decision. The most common things to die are RAM and PSUs. The latter are redundant on every server I’ve ever seen. The former is dirt cheap (especially as the hardware ages), and extremely easy to replace.

I have 13 year old Dell R620s that have been running 24/7/365 for a few years in a suboptimal environment at this point (I mean, minus occasional restarts for kernel updates, brief maintenance periods, etc.). The only thing I’ve had to replace were RAM and a single PSU.

npalli
DHH's focus getting massive saving on storage costs is a head-scratcher. Usually, egress costs are punitive. Given LLM's are pretty good now, I modeled the break-even point of such a move from S3 (6PB) to a colo. Breakeven time 18 years(!!), even with 8 year depreciation cycle still looking at 8 years to breakeven. This is not including falling S3 costs in Amazon. Truth be told he is probably spending 10's of millions on servers running Rails (SSR), probably can cut that in half by moving to golang or java. This is peanuts in comparison.

  1. Initial costs of $2.5M to provision the hardware (disks, servers, enclosures, networking equipment, redundancy, software solutions etc.)
  2. Facility OPEX: $50,000/month (Power, connective, monitoring etc..)
  3. Staffing and Operation tools: $10,000 / month
  4. Replacement cycle of 5 years so assume $500,000/year ~ $41K/month
bauruine
2. Facility OPEX: $50,000/month (Power, connective, monitoring etc..)

Are you sure? It looks like they only have 1 rack (+1 in another facility for redundancy) and seem to have 40Gbit/s connectivity.

A full rack is in the range of 1k, connectivity around 600$ per 10Gbit/s. I have no idea how much power they consume but I doubt it's 40k$+ per month for a storage workload. I would guess they are in the 10k$ range. Those are only list prices I've seen in the wild so take it with a grain of salt but 50k seem VERY high.

npalli
1 rack is a recipe for failure, you need to split, even without that consider

  how many SSD's you need.
  power usage for all those SSD's
  inter-site connectivity (need to keep transferring data between the sites otherwise customers are going to be very surprised.
  maintenance and software costs (at the colo level)
All these add up, closer to $50K than some $2K (LOL). The way you guys (below) are talking, this is not some home server that serves personal videos. It runs (some) business operations for thousands of small/medium companies.
bauruine
I've talked about 2 racks in different datacenters for 10k not 2k total. As you seem to like LLM here's the answer for a 50k colo

For $50,000/month in the USA, you could likely secure:

    Multiple full cabinets (approximately 10-12 cabinets)
    Higher power allocation (50-60kW total)
    Extensive bandwidth packages with multiple high-capacity connections
This seems both on the space of 10 full racks and 50kw power extensive.

I'm open to more details from you on why 50k is reasonable.

omnimus
The power must be way less especially without spinning drives.
ahofmann
You numbers are way off. They posted the initial cost of the storage. Also the support costs.

The facility opex of 50k a month is just wrong for one rack.

Whatever LLM you used, it told you good sounding bullshit. Like LLMs do.

bigfatkitten
50k is in the ballpark for a 200kW GPU rack, which they are quite obviously not running.
andrewaylett
I'd like to note that they're not swapping like-for-like here. The differences might not be material for the use-case, and AWS probably don't even offer a product that does exactly what the Pure Storage system does, but S3 has a lot more resiliency as well as other features.

That doesn't mean they're wrong to move, it means you need to be careful to make sure that you pay for what you need, and try to avoid paying extra if all it gives you is stuff you don't want. I value the extra functionality, so I'm not moving my data off S3.

diggan
I wish people were as careful when thinking about adopting S3 as they seemingly are careful when others move away from S3.

Somewhere along the line, people starting defaulting to "at least 3 nodes for backend" and "cloud services for all infrastructure" even if the product they're building haven't even found product market fit.

Sure, if you know for a fact that your traffic will go and down more than 50% during a normal day, go for something that scales up and down quickly. But for most other use cases, the extra cost of cloud doesn't really make much financial sense, unless you're a fat VC-funded startup cat.

gavinray
That reminds me of a friend who ran a SaaS business off a Heztner VPC.

It was a Node.js app he deployed via SSH and ran under a systemd job.

Used directories of JSON files as a database and the business logic was handled by a single endpoint that took JSON RPC payloads with different action types and metadata.

The app scaled to ~10,000 daily users like this.

stavros
I run multiple SaaSes, some popular, some not so much, off a single $20/mo Hetzner VPS with Dokku on it. It works great, and I've never needed to worry about anything.

Meanwhile, I see friends working on MVPs with 1-2 non-paying customers who already have costs in the thousands of dollars a month, but "it's fine because we got free money for a year". Yes, but that means that your company now has an expiration date of a year.

kristianp
Why use Dokku, and not, say, docker-compose?
stavros
I like the automated deployments. These days I'd probably use Harbormaster, but I'm biased because I wrote it.
andrewaylett
Hear, hear.

My personal AWS bill is roughly $10/month, all for S3. We're not talking millions here :). Personal compute is a mix of OVH and on-prem.

Work is an entirely different kettle of fish, at an entirely different scale, and primarily runs compute on spot: https://aws.amazon.com/blogs/aws/capacity-optimized-spot-ins...

Being able to scale down, rather than needing to pay for peak capacity, genuinely does save us large amounts of money. But it's a capability that we needed to build out, not something that happened by magic. And it does require that our services are big enough to scale for load, not just for redundancy.

roncesvalles
Scale down-to/up-from zero is a helluva drug. You're describing poor engineering within the cloud itself. If you're using cloud, use it for the right things. Container-as-a-service is awesome. SaaS databases are nifty. Build up your startup on the cloud, then save costs by migrating off.
Axsuul
Cloud is expensive but hardware failures are at least handled gracefully. With coloc you'd have some serious downtime. That means you'd need to get to a certain level of redundancy in order to have coloc make sense.

I'd love to move to coloc for my SaaS but it doesn't feel as resilient. Please correct me if I'm wrong as I'd love to move off the cloud.

nijave
Not sure cloud is necessarily more resilient--imo it's less resilient. On the other hand, it's fully automated with robust APIs so there's easy tools to mitigate failures like node/machine sets (scale sets, scaling groups, auto scaling groups, whatever the provider calls them)

You could use an orchestration solution to help handle automatic failover. There's a handful of container-based options from heavy duty Kubernetes to Docker Swarm and Nomad.

Containers are nice since you can bypass most of the host management where you only need basic security patching and installation of your container runtime. There's also k8s distros like OpenShift to make k8s setup easier if you go that route.

Axsuul
Yep I use orchestration (Nomad) but still you would need hardware redundancy. For example, the database server is currently a single point of failure. In the cloud, if there's a hardware failure, it will simply go down and come back up with a new instance. In coloc, you'd need to have the data center debug and replace hardware which means extended downtime.
NorwegianDude
When using colocation, nothing is stopping people from storing the database data externally from the server running the database like some cloud services do. But doing so, either in cloud or not, does have a serious downside: greatly increased latency.
Axsuul
Kind of defeats the purpose of colocation if you're not also running the database on your own server.
nijave
Patroni will manage a PG cluster and auto fail over. I've heard of Stolon as well. If you're on k8s, there's a couple good operators that will handle this

I believe paid PG vendors like EnterpriseDB and maybe Crunchy have their own tools

sgarland
You would not need to have extended downtime. Every major RDBMS that I’m aware of supports standby nodes, and if you want, a full active-active cluster (not recommended, personally).

The downtime is as long as you have your health check monitoring interval set up for.

cullenking
Enterprise server gear is pretty reliable, and you build your infra to be fully redundant. In our setup, no single machine failure will take us offline. I have 13 machines in a rack running a > 10mm ARR business, and haven't had any significant hardware failures. We have had occasional drive failures, but everything is a RAID1 at a minimum so they are a non issue.

We just replaced our top of rack firewall/proxies that were 11 years old and working just fine. We did it for power and reliability concerns, not because there was a problem. App servers get upgraded more often, but that's because of density and performance improvements.

What does cause a service blip fairly regularly is a single upstream ISP. I will have a second ISP into our rack shortly, which means that whole class of short outage will go away. It's really the only weak spot we've observed. That being said, we are in a nice datacenter that is a critical hub in the pacific northwest. I'm sure a budget datacenter will have a different class of reliability problems that I am not familiar with.

But again, an occasional 15m outage is really not a big deal business wise. Unless you are running a banking service or something, no one cares when something happens for 15m. Heck, all my banks regularly have "maintenance" outages that are unpredictable. I promise, no one relaly cares about five nines of reliability in the strong majority of services.

Axsuul
Sounds great. Yep, what I mean is you will need to make your systems fully redundant before considering cloud if your business depends on reliability and uptime. That usually requires the business to reach a certain scale first.
sgarland
Sure, but making something redundant is not really that difficult. HAProxy in front N nodes across M racks, ideally in separate DCs, and then a floating IP in front of your HAProxies. Set up hot standby for your DB.

I used to joke that my homelab almost had better reliability than any company I’d been at, save for my ISP’s spotty availability. Now that I have a failover WAN, it literally is more reliable. In the five years of running a rack, I’ve had precisely one catastrophic hardware failure (mobo died on a Supermicro). Even then, I had a standby node, so it was more of an annoyance (the standby ran hotter and louder) than anything.

floren
Hot spares and remote hands will get you a lot.

And when you get down to it, AWS isn't actually that reliable. I thought EBS volumes had magic redundancy foo but it turns out they can fail and they fail in a less obvious way than a regular disk. AWS networking is constantly bouncing and the virtual network adapters just sometimes stop working. They're also runnung old CPUs.

Depending on your workload you may be able pay off your new hardware with just a couple months' savings.

dilyevsky
This. With AFRs as they are today and warranty options and remote hands it’s hardly as bad as most people seem to think especially if their past recollection of working with colocation is from 20 years ago
Axsuul
Got any recommended providers?
dilyevsky
Equinix and Digital Realty are gold standard especially if you need comprehensive remote hands but $$$. CoreSite is also good and cheaper if you're in the US.
Hnrobert42
I wonder if the savings includes the cost of labor to maintain the physical servers, cabling, performance and security security monitoring, etc. Not saying it doesn't, I just wonder.
znpy
I wonder if the savings includes the cost of labor to maintain the physical servers, cabling,

This appears to be a valid point, but it really isn't.

In a case you're paying sysadmins, in the other you're paying cloud engineers.

performance and security security monitoring, etc. Not saying it doesn't, I just wonder.

You'd be paying those anyways. AWS's mantra is that amazon takes care OF the security of the cloud but YOU (the client company) take care of the security IN the cloud. Same goes for performance etc.

exe34
from listening to people who work with aws talking about their work, I don't get the impression they save any time - it's just different stuff. they're constantly doing things with permissions, setting up something or other, trying to work out why something can't find something else - it doesn't just magically solve all your problems. presumably once you figure out everything for you 2000 customers you can scale to a billion with just your credit card info, but most people won't work on stuff that scales to billions of users.
vidarh
Indeed, I offer consultancy for both cloud-based clients and ones using managed hosting like Hetzner, and colo'ed own servers, and the clients who host in the cloud routinely need more of my time on a per-server/instance basis.

Most of the time when people bring up these costs they have no experience with modern server hardware, and hosting. You can have the server shipped directly to a colo, you can file tickets with the provider to have it plugged into your power bar and your network, and you can connect with your IPMI client and set it to boot from your PXE/bootp/tftp server to get an install image.

With a well-configured setup, you have a one-off cost to set up your firewalls, wire-up the switches and power bars, and set up a server for the rest to network boot off, and the rest of bringing up your servers is near automatic and most management can be done remotely via IPMI or similar.

It's not the 90's any more.

Winsaucerer
Sounds like the real challenge is experience/skill then? Easier to find people who know how to use AWS/Azure/GCP, than know either (a) your colo'ed setup, or (b) how to do a good appropriate modern colo setup.

(this is a question disguised as a statement, since I'm interested in your opinion)

vidarh
There is something to that, though these skills are fairly easy to buy. Lots of consultancies like mine offer management of these things on retainer. But a lot of developers have passing knowledge of AWS etc., but none with managed hosting and might well prefer the cloud options for that reasons.

Poorly managed old managed setups also probabl burned a lot of more experienced people. E.g. if you had to fill in stupid forms and request a server weeks ahead, odds are cloud equals freedom in your mind, even though a well run infra setup could offer you to spin up container workloads and leave dealing with adding capacity as a background concern developers don't need to think about.

Personally, I recall well the time I had to call in to Yahoo's hardware review board, chaired by one of the founders (Filo) because the billing system I managed, which handled millions of dollars worth of transactions, needed a new database server - priced around $10k. There were at least a dozen senior people on that call.

If that was your experience of colo/managed servers/on prem, it's not unsurprising if you value cloud services far higher than their costs for the sake of avoiding that bureaucracy.

I'm working on tooling that I hope will change that, by making compute and storage on cheaper managed providers vs. cloud providers fungible commodities, but it's a hard problem.

sebazzz
I don't get the impression they save any time - it's just different stuff. they're constantly doing things with permissions

Permissions in the cloud is whole other beast, especially in Azure. You can easily spend a week figuring out various managed identity issues.

As for time saving: I've noticed cloud engineers often build a huge contraption of Terraform, Ansible scripts then bots and processes around them. That is then where the focus and time goes too.

And with any software, it is never done. It can always be better.

Classic sysops is much faster in a state of "done, no need to touch it now".

Tractor8626
I'm pretty sure cabling is included in cost of renting space in datacenter.
DeathArrow
Using SSD seems like a waste of money. I think having a raid like structure with good old spinning hard disks is good enough. Does Amazon use SSD only for S3?

I know backblaze uses HDDs and some raid like addressing.

nijave
Not sure Amazon posts much about S3 architecture but some years ago I think they had an article about using spinners. They had so many HDDs, though, that 100-250 IOPs * <the insane number of total drives> meant they still had insane throughput.

Not sure they're at the scale in the article to get good performance with HDDs but someone with a storage eng background can correct me. NVMe seems like a waste, though. I imagine you can still get solid performance with SATA/SAS (250 to 50-80k IOPs is still a pretty huge leap)

nodesocket
Any idea what software they are using for the s3 data migration? rclone?
roxolotl
Yea I’d love to know the answer to this. Of course you can just download petabytes of data but I imagine there’s gotta be more to it than just rclone. Or maybe it is that simple.
Symbiote
I used rclone to migrate about 2PB of data between datacentres last year.

There were some ways to make it go a bit faster, found by reading the rclone manual, but otherwise no surprises.

I wasn't sure what the maximum transfer speed could have been, but as one side was still the production system I didn't care to reach the limit anyway. Over 10Gb/s anyway.

nodesocket
Were the files lots of smaller files or lots of large files? I imagine if lots of small files requires a lot of concurrent threads to utilize a 10Gbps connection.
Symbiote
Lots of large files, I think most would have been over 50MB.

There was a folder of smaller files to move, around a million files each less than 100kB. That took longer, but not long enough that I bothered to try and speed it up.

ksec
I wonder how many Racks would there be after the total Cloud Exit from Basecamp. It is a nice way to think about scale.

The whole business that used to be 1/5 of the size 10 years ago were running on 100s of racks. And today it could be less than 10.

mindcrash
David wrote about that earlier. They are running on just about 8 racks in each DC (and they use 2).

That might exclude the new Pure Storage setup, though.

https://world.hey.com/dhh/we-stand-to-save-7m-over-five-year...

ksec
I think the Pure Storage is at best half a rack.

I dont think it is 8 rack per DC, it is 4 rack per DC.

We currently spend about $60,000/month on eight dedicated racks between our two data centers

And they are only doing 64 Core. We will have 256 Zen 6 core next year. And it seems by next year, if they were willing to pay for density they could have fitted everything inside one rack per DC.

Exciting times.

Edit: Actually if Intel were to push 18A on server it would make performance / density even better.

nijave
Actually if Intel were to push 18A on server it would make performance / density even better

Don't you end up needing pretty insane electric capacity not to mention cooling for that kind of density?

RadiozRadioz
I thought you could pay Amazon to send you a van full of hard disks if you need to transfer that much data.
jarito
Pretty sure those services are only for transferring data into S3, not out.
byefruit
https://aws.amazon.com/snowball/pricing/ snowball seems to support getting data out of S3 though you still end up paying extortionate egress charges.
dilyevsky
They wave egress if you’re leaving for 60 days, at least when you migrate via standard api
hbogert
if vendors like Pure weren't so incredibly irritating with the documentation behind a paywall i'd actually recommend them to clients. What happens now is, a client already has pure, and as a contractor I can't get access to documentation. Seriously, what year is it.
otterley
If someone is hiding something, it means they have something to hide.
theptip
Would love to see their growth rates, but Basecamp seems like a canonical example of where renting makes the least sense.

They are bootstrapped and not hyperscaling, so they don’t need flexibility for rapid and unpredictable growth (VC backed startups do). And they have a strong engineering org, so they don’t benefit as much from buying reliability (large non-tech companies often do).

killer32
We’ve been running Rook-Ceph in production across multiple client environments. In one example, we built a setup with 8 refurbished Dell servers (128GB RAM, 8–14 JBOD disks each) over 10G networking. It supports geo-replication between sites and has been stable for over 2 years. Total hardware cost was under $100k.

Rook simplifies the operational overhead of Ceph quite a lot, especially in Kubernetes-native stacks. For teams with large data and HA requirements, it's been a solid on-prem alternative. SLA-backed managed services are also becoming more common, which helps reduce the operational burden even further.

otterley
Is Hey not able to take advantage of S3 intelligent tiering? It’s way cheaper than S3 standard for infrequently accessed objects. I wouldn’t be surprised if Hey’s access patterns lend themselves well to it.

(Yes, I’m aware that Hey’s decision to evacuate the cloud is a fait accompli, but I also can’t help but wonder if there are loads of potential savings that are being left out of the discussion.)

albert_e
Even if they move only part of the data before the deadline ..they stop paying for that the moment they delete it from S3. (Doesn't work that way in reverse.)

They can start saving thousands of dollars even before the deadline if they are able to start moving as soon as their own infra is up and incrementally move and delete data from S3. If their data consumers can work with that, that is.

If part of the data does not need the full S3 guarantees of durability and availability ..they could probably save more by using cheaper tiers while on S3.

siscia
What are the stats of the new system?

Cost per giga?

Latency curve on read? Latency curve on write?

Support for concurrency?

Max throughput?

I think the article is somehow shallow, but S3 is a very deep and flexible system. Of course, someone will just need it's basic features, but someone else may be interested in advanced features as well.

christina97
I very much appreciate the concrete numbers in this post!
calderwoodra
Their current rate is about ~20% cheaper (4k/day), so missing the deadline isn't that big of a deal.
tgtweak
The frustrating part is they'll give you $1500/day back again if you let them know you're leaving... Very annoying tbh.
FireBeyond
But those big numbers always seem a bit abstract to me. The idea of paying $5,000/day, if we miss our departure date, is awfully concrete in comparison.

Says the guy who owns multiple hypercars, some of which are $3M+ (Aston Martin Valkyrie)...