It's difficult to answer without a better understanding on your customers' workloads and how those trigger your outages. There's a bunch of valid angles from which to look at this.
If your product consistently buckles under customer workloads that they paid to be able to run, it sounds like you have either an underprovisioning or an overcommitment problem.
If you accept customer workload spikes that you don't have the resources to serve but would be able to process if they were more spread over time, it sounds like you have an admission control problem.
If it's a matter of adding resources to respond to customer activity spikes and you just have to do it manually, it sounds like you have an automation problem.
If your pager load is becoming such that you can't do project work to address whichever ones of the above are relevant to you, it's time to hand the pager back to devs. If you don't have the institutional authority to hand back the pager to devs, it sounds like you have a management problem.