Braze, a lifecycle engagement platform, collects data from customers’ applications from web, email, iOS devices, Android devices, and Smart TVs. Braze ingests that information and more to build unique user profiles that help marketing brands message their customers in more relevant, engaging, and timely ways across different platforms, using historical, in-the-moment, and predictive data.
Salvatore Poliandro is currently the Director of DevOps and Security at Braze and has been at Braze for the last three years. Sal leads multiple teams, including Site Reliability Engineers, IT and IT Operations. His group is focused on IT infrastructure through which they process 5 million jobs per minute. They make sure that infrastructure and applications can scale as required with the current architecture. Sal’s team owns monitoring, Continuous Integration and Continuous Delivery (CI/CD), alerting, cost and everything above the application layer.
Braze as a company began with a dedicated environment for infrastructure that included 30 employees, fewer than 12 servers and did fewer than 100 million API calls a day. Sal’s first challenge was scaling their infrastructure at a rapid pace while monitoring and governing their environment with complete transparency. The second concern was breaking their monolith application into microservices to be more efficient in resource usage and also drive more accountability.
Braze decided to move to Amazon Web Services (AWS) in order to gain agility and scale. Now, they have more than 240 employees and they process more than 300 million API calls every hour. Braze began with a primary monolith application that was developed 7 years ago in Ruby. Currently, as a monolith app, it runs on thousands of Amazon EC2 instances across multiple regions, backed on hundreds of Redis nodes, and uses thousands of Mongo database nodes.
Finding a solution
In order to ensure complete visibility while transforming and scaling rapidly, Sal and his team decided to adopt a new approach. Braze didn’t just want to choose vendors, the company was looking to forge lasting business partnerships. After going through rounds of evaluations Braze selected CloudHealth Technologies and Datadog as the key partners in the company’s transformation journey, based on the strength of their technology. Sal highlighted that the internal “build vs. buy” debate was initially weighed during the evaluation of CloudHealth Technologies and Datadog. However, considering the high Total Cost of Ownership (TCO) and the additional time for ongoing maintenance the team preferred to buy proven solutions.
Braze selected CloudHealth Technologies based on the platform’s proven leadership in Reserved Instance (RI) management, along with resource utilization and cost optimization. Sal was impressed that CloudHealth was proactively identifying and informing him about opportunities to save costs in AWS. He found it advantageous that since CloudHealth’s contract is based on Braze’s AWS spend.
“It was an easy decision [to select CloudHealth] when we moved to AWS. We were in a pre-series B phase of funding and burn rate was very important. I needed to optimize my cloud spend while saving time, even though those two things do not go in the same bucket together. That drove us to a decision to choose an external vendor and we chose CloudHealth over others for their proven solution,” adds Sal.
Braze chose Datadog over their incumbent and other vendors for Application Performance Monitoring (APM) support along with compliance and infrastructure monitoring. Sal remembers that “Integration with infrastructure and other cloud services was important for us. The incumbent could not do that. From the technical and financial perspective, Datadog made more sense to us. We wanted to make it easy to pull our data into a single platform. We tried to build that in-house—it was a huge undertaking and maintenance cost was just too high.”
Metrics like error rates, the amount of data ingested, and any delays with ingesting data or processing data for customers are critical for Braze. “Datadog is already doing this really well for Braze. It is our preferred platform for different services, versus just an application service that our incumbent could support,” mentions Sal.
Braze now has the visibility, agility, and transparency they need for their application development. Leveraging CloudHealth and Datadog together, Braze has gained the complete visibility needed to make well-informed decisions. Sal mentions that “We are a very transparent company and a lot of infrastructure was a black box for us. We could not show the granular spend earlier."
"Being able to provide a self-service mechanism where our engineers can poke around their spend, cost savings, and key metrics (even from a security standpoint) has really been a big benefit across various teams. We get fewer questions from finance on them as a result."Dr. of DevOps and Security, Braze
CloudHealth has saved Braze millions of dollars through RI recommendations. Additionally, through cost allocation to different departments based on CloudHealth reports and dashboards, Braze has added more accountability. Sal proudly mentions that “CloudHealth reporting allows me to keep the teams in check as we grow. When we are doing POCs or starting a new project I can easily check team spend every day and send an email to the users to make them aware. That is a huge benefit in terms of creating business awareness on what resources cost.”
As they have adopted Datadog, Braze has already realized benefits like reduction in TCO. Sal mentions that he has gotten over a day-and-a-half a week back for one of his most senior employees—which is enormous since he can now focus on more strategic, cost-saving, scalability, and performance initiatives. He adds that “We have gotten the most from APM to the point where our projected APM spend has decreased substantially over time. Datadog is running on 80% of our infrastructure.” With Datadog-led enablement sessions, each team at Braze has a clear understanding of how to leverage the platform for their own use case.
In addition to the value each platform has brought to Braze independently, together CloudHealth and Datadog help answer Braze executives’ chief concern: what is the company’s Total Cost of Goods Sold? Braze correlates data from CloudHealth and Datadog to determine their overall cloud usage and spend, and to break down that usage by individual customers to better understand where resources are allocated. From there, executives can determine the profit margin produced by individual customers and campaigns.
What would be Salvatore's advice for cloud beginners?
Having been through his own hybrid cloud journey, Sal begins by saying others should “monitor everything in your infrastructure.” He also suggests that you should display your ‘north star’ metrics somewhere. It will help instill a culture of being able to see what is going on and help make better decisions. Examples of ‘north star’ metrics for Sal’s team are average processing time, percentage of uptime, and processing time of data points. Whereas Braze’s product managers want to look at the depth of overall product usage, Braze’s operations team is most concerned with tracking the total number of campaigns and the speed in which they are sent through the platform on behalf of their customers.