It's five grand a day to miss our S3 exit
You can mitigate this to some extent by making some key architecture + vendor decisions upfront when first building… or just consider that some day you’ll need to do things like this. It’s not a novel problem.
A lot of people have convinced themselves that cloud is cheap, to the point that they don't even do a cursory investigation.
A lot of those even don't do the bare minimum to reduce hosting costs within the cloud they choose, or choose one of the cheaper clouds (AWS is absolutely extortionate for anything that requires significant amount of outbound bandwith), or put caching/CDN's in front (you can trivially slash your AWS egress costs dramatically).
Most of my consultancy work is on driving cost efficiencies for cloud, and I can usually safely guarantee the fee will pay for itself within months because people don't fix even the most low hanging fruit.
Now, a good accountant probably wouldn’t care one way or the other. Debits and credits balance either way. And spending more still means less profit in the long term, no matter how it looks on the books. But, in addition to the flexibility, that was what I always thought of as the main cloud benefit. It’s the same with leasing vs buying cars/computers/etc…
You're right that if the longest window is short enough to make autoscaling financially beneficial over managed hosting, then you also need to make sure that you don't regularly have other spikes that can tip things back to being unprofitable.
Where it does make sense in the short-term for this scenario is the experience and knowledge necessary to reliably run your own servers. If you don’t have that, you may not want to invest the time and effort to do so. But on pure cost, unless your bill is on the order of a few thousand per month, cloud will never win. It can’t; they have to make money.
One justifiable excuse is you simply don't know how much hardware you will need to buy if you're hitting hockeystick growth. That until you realize you can also go hybrid...
None of the colo'd setups I've worked on bar one used purchased servers - it's all been leased. But the majority of non-cloud workloads I've worked on have not even been leased, but rented.
OK says HPC, here's the quote for replacing one of the (currently three) supercomputers with a cloud service. Oh dear, that's bigger than your entire IT budget isn't it? So I guess we do need the DC for housing the supercomputers.
If we'd done that once I'd feel like well management weren't to know, but it recurs with about a 3-5 year periodicity. The perception seems to be "Cloud exists, therefore it must be cheaper, because if it wasn't cheaper why would it exist?" which reminds me of how people persuade themselves the $50 "genuine Apple" part must be better because if it wasn't better than this $15 part why would Apple charge $50 for it? Because you are a sucker is why.
AWS, however, remained 2x-3x as expensive, with the devops time factored in.
The perception seems to be "Cloud exists, therefore it must be cheaper, because if it wasn't cheaper why would it exist?
People are also blithely unaware that large customers get significant discounts, and so I regularly has to explain that BigCo X being hosted in AWS means at most that it is cost-effective for them because their spend means they're getting a significant discount over the already highest volume published pricing, and my clients usually are nowhere close to spend enough to be able to get those discounts.
If <service> is popular, it must also be cheap, beautiful, well documented, have every feature that exists and make you popular with your friends.
I once had a Product Manager try to start an argument with me: "Explain to me how it is possible that the service we pay 25k a month doesn't have <feature>. You don't know what you are saying.". It just didn't do what he wanted, and getting angry with them over the phone didn't magically made the feature appear.
If we'd done that once I'd feel like well management weren't to know, but it recurs with about a 3-5 year periodicity.
So basically every time management changes[1]?
[1] https://maexecsearch.com/average-c-suite-tenure-and-other-im...
† Name changed to protect individuals but also because frankly I don't care very much who is currently doing these roles, there'll be others.
A lot of people have convinced themselves that cloud is cheap
I've noticed this too, freelancing/consulting around in companies. I'm not sure where this idea even comes from, because when cloud first started making the news, the reasoning went something like "We're OK paying more since it's flexible, so we can scale up/down quickly", and that made sense. But somehow today a bunch of people (even engineers) are under the belief that cloud somehow is cheaper than the alternatives. That never made sense to me, even when you take into account hiring people specifically for running the infrastructure, unless you're a one-person team or have to aggressively scale up/down during a normal day.
A fair number of our clients routinely spin up workloads that are CPU bound on hundreds-to-thousands of nodes. These workloads can be EXTREMELY spiky, with a baseload for routine background jobs needing maybe 3-4 worker nodes, but with peak uses generating demand for something like 2k nodes, saturating all cores.
These peak uses also tend to be relatively time sensitive, to the point where having to wait two extra minutes for a result has real business impact. So our systems spin up capacity as needed, and once the load subsides, terminates unused nodes. After all, new ones can be brought up at will. When the peak loads are high (& short) enough, and the baseload low enough, the elastic nature of cloud systems has merit.
I would note that these are the types of clients who will happily absorb the cross-zone networking costs to ensure they have highly available, cross-zone failover scenarios covered. (Eg. have you ever done the math on just how much a busy cross-zone Kafka cluster generates in zonal egress costs?) They will still crunch the numbers to ensure that their transient workload pools have sufficient minimum capacity to service small calculations without pre-warm delay, while only running at high(er) capacity when actually needed.
Optimising for availability of live CPU seconds can be a ... fascinating problem space.
Most people aren't in those situations, though, but I think a lot of them think they're much closer to your scenario than the much more boring situation they're actually in.
But devs rarely pay attention to metrics. I've had clients with expensive Datadog setups where it was blatantly obvious that nobody had ever dug into the performance data, because if they did they'd have noticed that key metrics were simply not fed to it.
If they did pay attention, most of them would realise that their autoscaling rarely kicks in all that much, if at all. Often because it's poorly tuned, but also because most businesses see small enough daily cycles.
Factor in that the cost difference between instances vs. managed servers is quite significant, and you need to have significant spikes much shorter in duration than most businesses day/night variation to save money.
It can make sense to be able to spin up more capacity quickly, but then people need to consider that 1) a lot of managed hosting providers has hardware standing by and can automatically provision it for you rapidly too - unless you insist on only using your own purchased servers in a colo, you can get additional capacity quickly, 2) a lot of managed hosting providers also have cloud instances so you can mix and match, 3) worst case you can spin up cloud instances elsewhere and tie it into your network via a VPN.
Some offer the full range from colo via managed servers to cloud instances in the same datacentres.
Once you prep for a hybrid setup, incidentally, cloud becomes even less competitive, because suddenly you can risk pushing the load factor on your own/managed servers much closer to the wire, knowing you can spin up cloud instances as a fallback. As a result, the cost per request for managed servers drops significantly.
I also blame a lot of this on business often shielding engineering from seeing budgets and costs. I've been in quite senior positions in a number of companies where the CEO or CFO were flabbergasted when I asked for basics costing of staff and infra, because I saw it as essential in planning out architecture. Engineers who aren't used to seeing cost as part of their domain will never have a good picture of costs.
Internal company accounting can be weird and lead to unintuitive local optima. At companies I've worked at, what was objectively true was that cloud was often much cheaper than what the IT department would internally bill our department/project for the equivalent service.
I have even had it suggested that it might make selling a business or attracting investors harder if you used your own servers (not at the scale of having your own datacentre, just rented servers - smaller businesses still).
Another thing that comes up is that it might be more expensive but its a small fraction of operational expenses so no one really cares.
Maybe it's an understanding that doesn't change because the decision makers were non-techincal people (when finance oversees IT despite not understanding it)
Virtualizing and then sharing a dedicated server as a VPS was a big step forward.
Only, hardware kept getting cheaper and faster, as well as internet.
when finance oversees IT despite not understanding it
... and when IT often do not even get to see the spend, and/or isn't expected to.
I've had clients where only finance had permissions to get at the billing reports, and engineering only ever saw the billing data when finance were sufficiently shocked by a bill to ask them to dig into it - at which point they cared for long enough to get finance off their backs, and then stopped caring again.
That was in the late 00s and early 10s. PHP, Python, Ruby and even Java were slow. Every single language and framework has had massive performance improvements in the past 15 to 20 years. Anywhere from Java 2x to Ruby 3 - 10x.
When a server max out at 6 - 8 with Xeon core, compare to today at 192 Core. Every Core is at leats 2 - 3x faster per clock, with higher clock speed we are talking about 100x difference. Especially when IO used to be on HDD, SSD is easily 1000x faster. What used to wait for I/O is no longer an issue, the aggregate difference when all things added together including software could be 300x to 500x.
What you would need 500 2U server in 2010, you could now do it in one.
Modern web developers are so abstracted with hardware I dont think many realise what sort of difference in hardware improvements. I remember someone posted before 2016 Basecamp had dozens of Racks before moving to cloud. Now they have grown a lot bigger with Hey and they are only doing it with 8 racks and room to spare.
AWS on the other hand is trying to move more workload to ARM Graviton where they have a cost advantage. Given Amazon's stock price are now dependent on AWS, I dont think they will lower their price by much in the future. And we desperately need some competition in that area.
And if you’re not doing that you are hiring a bare metal servers provider that is still taking a portion of the money you’d be paying AWS.
Even if you don’t need to physically visit data centers thanks to your server management tools, the difference in the level of control you have between cloud and bare metal servers is large. You’re paying to enable workflows that have better automation and virtual networking capabilities.
I recently stood up an entire infrastructure in multiple global locations at once and the only reason I was able to do it in days instead of weeks or months was because of the APIs that Amazon provides that I can leverage with infrastructure automation tooling.
Once you are buying AWS reservations and avoiding their most expensive specialized managed products the price difference isn’t really worth trying to recover for many types of businesses. It’s probably worth it for Hey since they are providing a basic email service to consumers who aren’t paying a whole lot. But they still need something that’s “set it and forget it” which is why they are buying a storage solution that already comes with an S3 compatible API. So then I have to ask why they don’t save even more money and just buy Supermicro servers and install their own software? We all know why: because Amazon’s APIs are where the value is.
There is a lot of profit margin in software and usually your business is best spending their effort working on their core product rather than keeping the lights on, even for large companies. Plus, large companies get the largest discounts from cloud providers which makes data centers even less appealing.
“Convenience” isn’t just convenience, it’s also the flexibility to tear it all down and instantly stop spend. If I launch a product and it fails I just turn it off and it’s gone. Not so if I have my own data center and now I’ve got excess capacity.
How many are actually multi-region? How many actually do massive up/down-scaling on short notice? How many actually use many of those dozens to hundreds of services? How many actually use those complex permissions?
My experience tells me there are some, but there are more who treat AWS/GPC/Azure like a VPS-hoster that's 5-10x more expensive than other hosters. They are not multi-region, they don't do scaling, they go down entirely whenever the AZ has some issues etc. The most they do is maybe use RDS instead of installing mysql/pgsql themselves.
This idea that their basic users go down entirely when the AZ has some issues is ridiculous, a standard autoscaling group and load balancer basically forces you to be multi-AZ. Very much unlike a VPS.
Using RDS instead of self-installing SQL eliminates the need for an entire full time role for DB admin. So that’s kind of a big deal despite it being a “basic” use case.
A lot of services like ECS, elastic beanstalk, can make it so that you can wait longer to hire operations people and when you do they can migrate to more scalable solutions without having to do a major migration to some other provider or build up a self hosted solution custom. If you outgrow a VPS you have to do a major migration.
And if you take a look at the maturity and usefulness of the terraform providers SDKs, and other similar integrations of VPS and bare metal providers they are very basic when comparing to BOTO and the terraform provider.
I struggle to replicate the level of automation I can achieve with these cloud tools on my own homelab with Proxmox.
Using RDS instead of self-installing SQL eliminates the need for an entire full time role for DB admin.
No it doesn't. The value in a skilled DB admin is not in keeping the DB up and running, because no special skills are required to do that; the DB admin is an expert in performance. They add considerable value in ensuring you get the most bang for your buck from your infrastructure.
A popular modern alternative to this of course is to throw more money at RDS until your performance problems go away.
While you’re not wrong about DB admins being important for performance optimizations, RDS stops you from having an inexperienced administrator lose data in stupid ways.
I know because I used to be that stupid person. You don’t want to trust your company’s data to a generalist that you told to spin up a database they’ve never configured before (me) and hope they got good answers when they googled how to set up backups/snapshots/replication.
I can look at any company’s RDBMS who doesn’t have a full-time DB[A,RE] on staff and find ten things wrong very quickly. Duplicate indices, useless indices, suboptimal column types, bad or completely absent tuning, poor query performance…
It’s only when a company hits the top end of vertical scaling do they think, “maybe we should hire someone,” and the problem then is that some changes are extremely painful at that scale, and they don’t want to hear it.
IaaS (Proxmox) is a different layer than PaaS as we know.
The same orchestration tools (Terraform) can orchestrate Proxmox or other hypervisors just fine. Discounted licenses for VMware are readily available on ebay if that is preferred.
Proxmox has built-in node mirroring between multiple servers, it just works after it's connected.
For instance, I was surprised how large the market was for Amazon Connect - Amazon’s hosted call centers. It’s one of the Amazon services I have some experience in and I still get recruiters contacting me for those jobs even though I don’t really emphasize that specialty.
My experience is from 7 years of working with AWS. First at a startup with a lot of complex ETL and used a lot of services. But the spend wasn’t that great.
My next 5 years was between working at AWS (Professional Services) and two years at a a third party consulting company (full time) mostly as an implementation lead.
Even though my specialty is “cloud native application development” and I avoid migrations like the plague, most of the money in cloud consulting are large companies deciding to move to the cloud because they decided that the redundancy, lower maintenance overhead, and other higher level services were worth it.
It's trivial, to get equipment at a datacenter, where the equipment is visited for you on your behalf if you wish.
You can place your own equipment in a datacenter to manage yourself (dedicated servers).
You can have varying amounts of the hardware up to the software layer managed for you as a managed server, where others on site will do certain tasks.
Both of these can still be cheaper than cloud (which provides a convenience and a large markup to make often open source tools easy to administer from a web browser), and then paying someone to manage the cloud.
Global location at once can still be done with the reality of hybrid-cloud or cloud-agnostic setup requirements (not to be tied to one cloud only for fallback and independence).
It's the time in between that's the most awkward. When the potential savings are there that hiring an engineering team to internalize infrastructure will give a good return (were current pricing to stay), but you're not so big that just threatening to leave will cause the provider to offer you low margin pricing.
All I'd say is don't assume you're getting the best price you can get. Engineers are often terrible negotiators, we'd rather spend months solving a problem than have an awkward conversation. Before you commit to leaving, take that leverage into a conversation with your cloud sales rep.
Also, one trick to inflate your commitment expenses is asking your SaaS providers if it's possible to pay them through AWS or GCP marketplaces: it often counts against your commitment minimum expense, so not everything has to be instances and storage.
From this, you get a wide range of discounts in a bunch of products, not just instances. And I think those discounts go on top of some of the other discounts you regularly have, but I'm not sure and I'd had to check our billing.
So in this case, hundreds of billions. But the principle stands at lower company sizes, just with different numbers and amounts of leverage.
Cloud providers are looking for multi-year term, commitment to growth as much as/more than exact spend level now.
Even going multi-region, hiring dedicated 24/7 data center staff, and purchasing your own hardware amortizes out pretty quickly and can you a serious competitive advantage in pricing against others. This is especially true if you are a large consumer of bandwidth.
Engineers are often terrible negotiators, we'd rather spend months solving a problem than have an awkward conversation.
My experience is the opposite: lots of software developers ("engineers") would love to do "brutal" negotiations to fight against the "choking" done by the cloud vendors.
The reason why you commonly don't let software developers do these negotiations is thus the complete opposite: they apply (for the mentioned reasons) an ultra-hardball negotiation style (lacking all the diplomatic and business customs of politeness) that leads to vast lands of burnt soil. Thus, many (company) customers of the cloud providers fear that this hardball negotiation style destroys any future business relationship with the respective (and perhaps for reputation reasons a lot of other) cloud service provider(s).
It's almost always less expensive at the start, which is super important for the early stages of a company (your capital costs are basically zero when choosing say AWS).
Then after you're established, it's still cheaper when considering opportunity costs (minor improvements in margin aren't usually the thing that will 10x a company's value, and adding headcount has a real cost).
But then your uniqueness as a company will come into play and there will be some outsized expense that seems obscene for the value you get. For the article writer, it was S3, for the OP, it's bandwidth. For me it's lambdas (and bizarrely, cloud watch alarms). That's when you need to have a hard look and negotiate. Sometimes the standard pricing model really doesn't consider how you're using a certain service, after all it's configured to optimize revenue in the general case. That doesn't mean the provider isn't going to be willing to take a much lower margin on that service if you explain why the pricing model is an issue for you.
The reality is when you get to another certain point (larger than the point you describe) you start negotiating directly with those cloud providers and bypass their standard pricing models entirely.
And even if you do, you still end up with pretty horrible pricing, still paying per GB of "premium" traffic for some outrageously stupid reason, instead of going the route of unmetered connections and actually planning your infrastructure.
In terms of durability that's a universe apart.
I've seen a few of these deals with other vendors up close, the difference with public pricing is huge if you spend millions per year.
I remember trying to convince some very talented but newly minted ops professionals -- my colleagues -- to go on prem for cost. This was last year. They were scared. They didn't know how that would work or what safety guarantees there would be. They had a point, because the org I was at then didn't have any on prem presence, since they were such a young organization that they started in the cloud during "the good times". They always hired younger engineers for cost, so nearly no one in the org even knew how to do on prem infra. Switching then would have been a mistake for that org, even though cloud costs (even with large commit agreements) were north of six figures a month.
The Math stops working when you grow to a certain point.
That point is different for every business. For most of them it depends on how big cloud is in your COGS (cost of goods sold) which affects gross margins, which in turn is one of the most meaningful measures of company financial health. Depending on the nature of your business and the amount of revenue you collect in sales, many companies will never reach the point where there's measurable payback from repatriating. Others may reach that point, but it's a lower priority than other things like opening up new markets.
Many commenters seem to hold very doctrinaire opinions on this topic, when it's mostly basic P&L math.
$1.5 million/year
That is excessive and it's already $4K a day.
Lots of teams really underestimate cloud costs since there is an assumption that the hundreds of millions they are raising will give them enough runway to survive a few years despite losing money for years.
Even scaling would be somewhat of an issue depending on the tech stack. Imagine the cost of running standard Java micro-services and the "solution" was to "spin up hundreds of more nodes". The worst that I have seen was a bank proudly having up to 8,000 - 10,000 separate micro-services.
Just imagine the daily cost of that. Unjustifiable.
But of course the AWS cloud consultants would be happy to shill you their offerings at "cheap" prices, but in reality the pricing is designed for you to accumulate millions in costs as you scale on the tiniest amount of usage, even for testing.
So before you build the software, one must think about the costs of scaling if it becomes widely used rather than taking the easy approach in just spinning up nodes and increasing more costs and act as if you don't have the capital to solve the problem. You can only do that for so long until you don't.
There was nowhere near the same volume of data as Basecamp/Hey, not there was much processing power needed. It was purely bad engineering accumulated over 10 years.
After some time I was quite familiar with their stack and had gathered considerable domain experience. This led to an idea how to halve the database load (and the cost would presumably fall by a similar percentage), which I wanted to use as leverage during contract renegotiation.
I boldly offered to work for free to halve their database load, in exchange for being paid half the money this optimization would save over the course of one year. This would basically triple my pay, and they would still save money.
They declined, and I moved to a better opportunity.
Last I heard they had to pay a team of 4 new consultants for a year to implement the same idea I had. Without the domain knowledge, consultants couldn't progress as fast as I suspect I could have done (my estimated was 2 months of work).
I know it's very petty, but I regret revealing too many implementation details of the idea during the pitch and allowing the company to contract other consultants to see it done.
The challenge is that people don't believe you when you tell them they can save that much, no matter how evidence you prepare. I'm starting a sales effort for my agency right now, and one of the things we've worked on is to promise less than what we determine we can deliver after reviewing the clients costs, and raising our prices, because it's ironically easier to close on the basis of a promise to deliver 20%-30% savings at a relatively high cost than a promise to deliver 50%+ with little effort.
I think part of the expectation when contracting somewhere long-term (or just being an employee, for that matter) is that the amount of value you add per hour worked increases sharply over time, and slower than your fee. In other words, initially you're overpaid wrt your value-add, and then that corrects itself over time as you figure out what the company is all about.
No go. “It’s hard to hire for that skill set.” Is it $9 million/year hard?! You already have a team lead – me. This shit is not that hard; people will figure it out, I promise.
Would be alarming if it is 500% the staff salary, but at 50% that just seems the cost of outsourcing to standard that likely won't be achieved in house.
The two major problems were:
1. The volume of data itself was not that that big (I had a backup on my laptop for reproductions), but it was just too heavy for even the biggest things in AWS. Downtimes were very frequent. This is mostly due to decisions from 10 years ago.
2. Teams constantly busy putting out fires but still getting only 1-2% salary increases due to lack of new features.
EDIT: Since people like those war stories. The major cause for the performance issues was that each request from an internal user would sometimes trigger hundreds of queries to the database. Or worse: some GET requests would also perform gigantic writes to the Double-Entry Accounting system. It was very risky and very slow.
This was mostly due to over-reliance on abstractions that were too deep. Nobody knew which joins to make in the DB, or was too afraid, so they would instead call 5 or 6 classes and joining manually causing O(N^2) issues.
To give a dimension of how stupid it was: one specific optimization I worked on changed the rendering time of a certain table from 25 seconds to 2 miliseconds. It was nothing magic.
I'm glad I left.
On an off note, migrating to nosql might not have a lot of on paper benefit, but it does enforce developers to design their table and queries in a way that prevents this kind of query hell. Which might be worth it on its own.
relational DB is great, but just like java design pattern, it's being abused because it could be. People are happy doing stuff like that because it was low resistance and low effort, with consequences building up in the long term.
Database joins were fine, they just weren’t being made in the database itself, due to absurd amounts of abstraction.
I don’t disagree that rethinking the problem with NoSql would solve it (or maybe even would have prevented it), but on the other hand I bet having 5 layers of OOP could also mess up a perfect NoSql design.
And even if you don't want the hassle of storing the data yourself, there are many far cheaper outsourced options than S3.
Even scaling would be somewhat of an issue depending on the tech stack. Imagine the cost of running standard Java micro-services and the "solution" was to "spin up hundreds of more nodes". The worst that I have seen was a bank proudly having up to 8,000 - 10,000 separate micro-services. Just imagine the daily cost of that.
I'm not going to preach for thousands of micro-services necessarily, but they also make scaling easier and cheaper.
Not every service in your application receives the same load, and being able to scale up by increasing the 20% of Lambdas that receive 80% of the traffic, will result in massive savings too.
the hundreds of millions they are raising will give them enough runway to survive a few years despite losing money for years.
It's more that the decision makers at every stage are not incentivized to care, or at least, were not during the ZIRP period. This is slowly changing, as evidenced by more and more talks of "cloud exits".
Software engineers are encouraged by the job market to fill their resume with buzzwords and overengineer their solutions.
Engineering managers are encouraged by the job market to increase their headcount, so complicated solutions requiring lots of engineers actually play in their favor.
CTOs are encouraged by the job and VC funding market to make it look like their company is doing groundbreaking things and solving complex problems, so overengineering again plays in their favor. The fact these problems are self-inflicted doesn't matter, because everyone is playing the same game and has no reason to call them out for it.
Cloud providers reward companies/CTOs for behaving that way by extending invites to their conferences, which gives the people involved networking opportunities and "free" exposure for the company to hire more engineers to fuel the dumpster fire even more.
You don’t get any testing services baked into the pricing, you’re paying production pricing for setting up / tearing down environments for testing. They have little to nothing in the ways of running emulators locally for services and it leads to other solutions of varying quality.
It’s outrageous and something i will always hold against AWS forever. Not to mention their CDK is for shit. Their APIs are terrible and poorly documented. I don’t know why anyone chooses them still other than they seem to have the “nobody got fired for choosing AWS” effect.
Azure is really good at providing emulators for lots of their core services for local testing for instance. Firebase is too, though I can’t vouch for the wider GCP ecosystem
It was always just above three years to break even. After we decided to self-host I still kept tracking the prices in the spreadsheet and as hardware costs fluctuated Amazon adjusted their prices to match. I figured someone at Amazon must have had the same spreadsheet I was working with.
Hardware cycle is probably about 3 years also.
AWS outbound bandwidth costs in particular is tens of times higher than what you can get elsewhere, to the point that when clients insist on S3, e.g. because they're worried about durability (which is a valid consideration), I usually ask if they'd be happy to put a hefty cache at a cheaper provider in front - if you use lots of bandwidth, it's not usual for it to be cost effective to cache 100% of the dataset somewhere cheaper just to avoid AWS bandwidth charges.
Hardware like cars and laptop can continue to perform after they are written off, or even after the warranty.
The grade of hardware used is critical in servers.
Hyperscaling might mean commodity based servers. Hosting a large app does not mean using commodity component servers.
Hardware, when self hosting, does not need to be replaced every 3-5 years because it does not fail every 3-5 years. Depends on load and a bunch of factors.
Why?
We wouldn’t buy the cheap and disposable components a massive cloud or social media network might use to scale faster because they have a massive budget.
Besides, do providers really replace all their servers every 3-5 years? Hosting companies don’t seem to.
The cloud is many multiples more expensive than self hosting especially at scale. Hosting and cloud tools have brought down labour costs tremendously.
For the hardware, with the extremely clean environments servers run in, plus much cleaner electricity hardware runs much longer.
Purchase actual enterprise grade servers (HP Proliant, etc) that a company would buy for themselves for maximum reliability (compared to the commodity based ones of clouds) and those have so much reliability built into them that they sometimes never die.
You can still buy used proliant servers many, many, many generations old and they hum along just fine. It is bizarre but not.
Support is a few things. Warranty on parts and software. Extended support options which amounts to (hardware warranty) are always available for a fee, and achievable on your own.
If your software is a hypervisor you will be mirrored.
If a server has an issue the affected machine moves the load elsewhere.
The server has hot swap equipment. Takes a few moments to swap components if needed.
If you are self hosting, you can buy a used server or two or theee to have a backup and mirror and spare parts. It’s like buying a few NUCs.
Hosting corporately can be done not just with buying, but leasing too (meaning hardware swapping can happen). Add to this moving older equipment to less demanding tasks (if they ever do stay at load)z
Write off is am accounting term, not operational.
That's the point. I've just decommisioned 10 year old servers for a client. They were still working fine, but the system had finally been replaced.
If you're calculating break-even based on the rate at which you're writing off the accounting value of the servers, you'll end up with a far longer time to break-even than if you amortise the hardware cost over the projected actual lifetime of the hardware.
Depending on the jurisdiction (this seems to be common), equipment can be written off at a faster depreciation schedule than it's been kept or used for.
In that way, writing off is often one part maximizing depreciation schedule (to "write it off" as an asset in you business as quick as possible), and another part is how long does it take.
Insert stereotype of bean-counting and propellers-spinning.
This means it's perfectly possible to use equipment after it's been written off, and be in a position to re-purchase it when it fails.
Spreadsheets can be a disease this way, written for the single scenario it's evaluating and not entirely enough scenarios or forecasts.
We should assume multi-billion dollar clouds do not use single spreadsheets to understand how they make 5-10x (or higher) off the same server resources by selling them as individual API calls.
The markup on cloud services can be so astronomically high, having been around data centre hosting for your own bare metal servers, virtualized servers, going to the cloud 1000% and now realizing it's gotten much easier to self host personally, and professionally (with experience).
The assumptions of why one might use a cloud originate with the update of the cloud and remain anchored there, regardless of the changes and evolutions in place.
A lot of the other markup is obscured by splitting things into multiple categories, such as costs per requests, separate pricing for bandwidth etc. A lot of clients I talk to don't understand the pricing of the services they run, and the developers usually both don't know and don't care.
I have 13 year old Dell R620s that have been running 24/7/365 for a few years in a suboptimal environment at this point (I mean, minus occasional restarts for kernel updates, brief maintenance periods, etc.). The only thing I’ve had to replace were RAM and a single PSU.
1. Initial costs of $2.5M to provision the hardware (disks, servers, enclosures, networking equipment, redundancy, software solutions etc.)
2. Facility OPEX: $50,000/month (Power, connective, monitoring etc..)
3. Staffing and Operation tools: $10,000 / month
4. Replacement cycle of 5 years so assume $500,000/year ~ $41K/month
2. Facility OPEX: $50,000/month (Power, connective, monitoring etc..)
Are you sure? It looks like they only have 1 rack (+1 in another facility for redundancy) and seem to have 40Gbit/s connectivity.
A full rack is in the range of 1k, connectivity around 600$ per 10Gbit/s. I have no idea how much power they consume but I doubt it's 40k$+ per month for a storage workload. I would guess they are in the 10k$ range. Those are only list prices I've seen in the wild so take it with a grain of salt but 50k seem VERY high.
how many SSD's you need.
power usage for all those SSD's
inter-site connectivity (need to keep transferring data between the sites otherwise customers are going to be very surprised.
maintenance and software costs (at the colo level)
All these add up, closer to $50K than some $2K (LOL). The way you guys (below) are talking, this is not some home server that serves personal videos. It runs (some) business operations for thousands of small/medium companies.For $50,000/month in the USA, you could likely secure:
Multiple full cabinets (approximately 10-12 cabinets)
Higher power allocation (50-60kW total)
Extensive bandwidth packages with multiple high-capacity connections
This seems both on the space of 10 full racks and 50kw power extensive.I'm open to more details from you on why 50k is reasonable.
The facility opex of 50k a month is just wrong for one rack.
Whatever LLM you used, it told you good sounding bullshit. Like LLMs do.
That doesn't mean they're wrong to move, it means you need to be careful to make sure that you pay for what you need, and try to avoid paying extra if all it gives you is stuff you don't want. I value the extra functionality, so I'm not moving my data off S3.
Somewhere along the line, people starting defaulting to "at least 3 nodes for backend" and "cloud services for all infrastructure" even if the product they're building haven't even found product market fit.
Sure, if you know for a fact that your traffic will go and down more than 50% during a normal day, go for something that scales up and down quickly. But for most other use cases, the extra cost of cloud doesn't really make much financial sense, unless you're a fat VC-funded startup cat.
It was a Node.js app he deployed via SSH and ran under a systemd job.
Used directories of JSON files as a database and the business logic was handled by a single endpoint that took JSON RPC payloads with different action types and metadata.
The app scaled to ~10,000 daily users like this.
Meanwhile, I see friends working on MVPs with 1-2 non-paying customers who already have costs in the thousands of dollars a month, but "it's fine because we got free money for a year". Yes, but that means that your company now has an expiration date of a year.
My personal AWS bill is roughly $10/month, all for S3. We're not talking millions here :). Personal compute is a mix of OVH and on-prem.
Work is an entirely different kettle of fish, at an entirely different scale, and primarily runs compute on spot: https://aws.amazon.com/blogs/aws/capacity-optimized-spot-ins...
Being able to scale down, rather than needing to pay for peak capacity, genuinely does save us large amounts of money. But it's a capability that we needed to build out, not something that happened by magic. And it does require that our services are big enough to scale for load, not just for redundancy.
I'd love to move to coloc for my SaaS but it doesn't feel as resilient. Please correct me if I'm wrong as I'd love to move off the cloud.
You could use an orchestration solution to help handle automatic failover. There's a handful of container-based options from heavy duty Kubernetes to Docker Swarm and Nomad.
Containers are nice since you can bypass most of the host management where you only need basic security patching and installation of your container runtime. There's also k8s distros like OpenShift to make k8s setup easier if you go that route.
I believe paid PG vendors like EnterpriseDB and maybe Crunchy have their own tools
We just replaced our top of rack firewall/proxies that were 11 years old and working just fine. We did it for power and reliability concerns, not because there was a problem. App servers get upgraded more often, but that's because of density and performance improvements.
What does cause a service blip fairly regularly is a single upstream ISP. I will have a second ISP into our rack shortly, which means that whole class of short outage will go away. It's really the only weak spot we've observed. That being said, we are in a nice datacenter that is a critical hub in the pacific northwest. I'm sure a budget datacenter will have a different class of reliability problems that I am not familiar with.
But again, an occasional 15m outage is really not a big deal business wise. Unless you are running a banking service or something, no one cares when something happens for 15m. Heck, all my banks regularly have "maintenance" outages that are unpredictable. I promise, no one relaly cares about five nines of reliability in the strong majority of services.
I used to joke that my homelab almost had better reliability than any company I’d been at, save for my ISP’s spotty availability. Now that I have a failover WAN, it literally is more reliable. In the five years of running a rack, I’ve had precisely one catastrophic hardware failure (mobo died on a Supermicro). Even then, I had a standby node, so it was more of an annoyance (the standby ran hotter and louder) than anything.
And when you get down to it, AWS isn't actually that reliable. I thought EBS volumes had magic redundancy foo but it turns out they can fail and they fail in a less obvious way than a regular disk. AWS networking is constantly bouncing and the virtual network adapters just sometimes stop working. They're also runnung old CPUs.
Depending on your workload you may be able pay off your new hardware with just a couple months' savings.
I wonder if the savings includes the cost of labor to maintain the physical servers, cabling,
This appears to be a valid point, but it really isn't.
In a case you're paying sysadmins, in the other you're paying cloud engineers.
performance and security security monitoring, etc. Not saying it doesn't, I just wonder.
You'd be paying those anyways. AWS's mantra is that amazon takes care OF the security of the cloud but YOU (the client company) take care of the security IN the cloud. Same goes for performance etc.
Most of the time when people bring up these costs they have no experience with modern server hardware, and hosting. You can have the server shipped directly to a colo, you can file tickets with the provider to have it plugged into your power bar and your network, and you can connect with your IPMI client and set it to boot from your PXE/bootp/tftp server to get an install image.
With a well-configured setup, you have a one-off cost to set up your firewalls, wire-up the switches and power bars, and set up a server for the rest to network boot off, and the rest of bringing up your servers is near automatic and most management can be done remotely via IPMI or similar.
It's not the 90's any more.
(this is a question disguised as a statement, since I'm interested in your opinion)
Poorly managed old managed setups also probabl burned a lot of more experienced people. E.g. if you had to fill in stupid forms and request a server weeks ahead, odds are cloud equals freedom in your mind, even though a well run infra setup could offer you to spin up container workloads and leave dealing with adding capacity as a background concern developers don't need to think about.
Personally, I recall well the time I had to call in to Yahoo's hardware review board, chaired by one of the founders (Filo) because the billing system I managed, which handled millions of dollars worth of transactions, needed a new database server - priced around $10k. There were at least a dozen senior people on that call.
If that was your experience of colo/managed servers/on prem, it's not unsurprising if you value cloud services far higher than their costs for the sake of avoiding that bureaucracy.
I'm working on tooling that I hope will change that, by making compute and storage on cheaper managed providers vs. cloud providers fungible commodities, but it's a hard problem.
I don't get the impression they save any time - it's just different stuff. they're constantly doing things with permissions
Permissions in the cloud is whole other beast, especially in Azure. You can easily spend a week figuring out various managed identity issues.
As for time saving: I've noticed cloud engineers often build a huge contraption of Terraform, Ansible scripts then bots and processes around them. That is then where the focus and time goes too.
And with any software, it is never done. It can always be better.
Classic sysops is much faster in a state of "done, no need to touch it now".
I know backblaze uses HDDs and some raid like addressing.
Not sure they're at the scale in the article to get good performance with HDDs but someone with a storage eng background can correct me. NVMe seems like a waste, though. I imagine you can still get solid performance with SATA/SAS (250 to 50-80k IOPs is still a pretty huge leap)
There were some ways to make it go a bit faster, found by reading the rclone manual, but otherwise no surprises.
I wasn't sure what the maximum transfer speed could have been, but as one side was still the production system I didn't care to reach the limit anyway. Over 10Gb/s anyway.
The whole business that used to be 1/5 of the size 10 years ago were running on 100s of racks. And today it could be less than 10.
That might exclude the new Pure Storage setup, though.
https://world.hey.com/dhh/we-stand-to-save-7m-over-five-year...
I dont think it is 8 rack per DC, it is 4 rack per DC.
We currently spend about $60,000/month on eight dedicated racks between our two data centers
And they are only doing 64 Core. We will have 256 Zen 6 core next year. And it seems by next year, if they were willing to pay for density they could have fitted everything inside one rack per DC.
Exciting times.
Edit: Actually if Intel were to push 18A on server it would make performance / density even better.
They are bootstrapped and not hyperscaling, so they don’t need flexibility for rapid and unpredictable growth (VC backed startups do). And they have a strong engineering org, so they don’t benefit as much from buying reliability (large non-tech companies often do).
Rook simplifies the operational overhead of Ceph quite a lot, especially in Kubernetes-native stacks. For teams with large data and HA requirements, it's been a solid on-prem alternative. SLA-backed managed services are also becoming more common, which helps reduce the operational burden even further.
(Yes, I’m aware that Hey’s decision to evacuate the cloud is a fait accompli, but I also can’t help but wonder if there are loads of potential savings that are being left out of the discussion.)
They can start saving thousands of dollars even before the deadline if they are able to start moving as soon as their own infra is up and incrementally move and delete data from S3. If their data consumers can work with that, that is.
If part of the data does not need the full S3 guarantees of durability and availability ..they could probably save more by using cheaper tiers while on S3.
Cost per giga?
Latency curve on read? Latency curve on write?
Support for concurrency?
Max throughput?
I think the article is somehow shallow, but S3 is a very deep and flexible system. Of course, someone will just need it's basic features, but someone else may be interested in advanced features as well.