MOTIVATION
From the beginning, when the project was just a “startup” idea, Heroku was our go-to PaaS for deploying applications without getting bogged down by the complexities of infrastructure management. It's an excellent option for early-stage projects when your focus is on getting an app out there fast. However, as our project evolved over the years and started gaining users in the millions, we encountered several limitations that prompted us to seek a new platform for our backend services.
GOAL
Our primary goal was to improve our application's security and achieve greater flexibility in managing our infrastructure. A significant concern was the necessity for our main Postgres database to be publicly accessible from the internet so it could connect with Heroku instances. Additionally, we needed autoscaling for our instances, given the fluctuating traffic patterns to our backend throughout the day.
Other considerations also played a part in our decision to migrate. The costs associated with certain Heroku services, particularly managed databases, were becoming too expensive. Recent security incidents, such as the GitHub integration incident, further underscored our concerns. Given that we were already utilizing AWS for portions of our infrastructure, it made sense to consolidate our operations and migrate everything from Heroku to AWS.
REQUIREMENTS
Before we kicked off the migration of our backend services and data to AWS, we laid out our goals and requirements. We aimed to retain all the functionalities provided by Heroku that we were already leveraging while also tapping into the new capabilities offered by AWS.
Here's a rundown of the requirements we held:
- Seamless User Experience: Be invisible to the app users, ensuring no disruption in their experience.
- Enhanced Security: Operate within an isolated subnet inside our Virtual Private Cloud (VPC) to enhance security.
- Robust Monitoring and Alerting: Implement comprehensive monitoring and alerting systems for proactive issue resolution and performance optimization.
- Advanced Log Management: Establish a system for the processing and analysis of application logs, aiding in troubleshooting and understanding user behavior.
- Streamlined Rollbacks: Ensure the ability to easily revert to previous versions of our app, enhancing reliability.
- Simple Service Management: Facilitate the easy restart of all backend (BE) services, improving manageability and uptime.
- Improved Scalability: Be easily scalable and flexible (because we expected the growth of both traffic and data).
TARGET ARCHITECTURE
Our platform comprises two distinct applications: Zoe and Surge. Each application operates across multiple environments, with each pairing of app and environment safeguarded within its own VPC, ensuring complete isolation.
Backend Services
The backbone of our backend architecture is straightforward, consisting of:
- A main API service that handles HTTP requests from our mobile applications
- A worker service dedicated to executing background tasks
- A scheduler service designed to manage recurring events
For deployment, we have chosen AWS Fargate because of its scalability (it supports auto-scaling out-of-the-box) and ease of management.
Data Storage for User Content
Data storage is managed through an RDS Postgres database, supporting our primary data needs. For temporary data storage and caching, ElastiCache Redis serves our requirements efficiently.
User-generated content, such as photos, is securely stored in S3 and served back to users via CloudFront CDN, ensuring a fast and reliable delivery network. Additionally, AWS Rekognition Content Moderation safeguards our community by identifying and filtering inappropriate content. Photo processing tasks, including resizing and format conversion, are handled by serverless Lambda functions.
Logs & Monitoring
All logs and metrics are stored and analyzed through AWS CloudWatch. This allows us to maintain optimal performance and quickly address any issues.
Configuration
AWS Parameter Store and Secrets Manager facilitate configuration management and secure secret handling, ensuring that sensitive information is protected and easily accessible for authorized use only.
EXECUTION
CDK Infrastructure
Deciding to define our infrastructure as code was a pivotal step for us, aiming for better visibility and control over what's deployed in our AWS account. When weighing our options between AWS CDK, Terraform and Pulumi, we leaned towards CDK. Our decision was influenced by our team's familiarity with TypeScript, which is our primary programming language, and CDK being a “native” option for AWS. This choice also enabled us to move away from the Serverless framework for deploying Lambda functions, a tool with which our experiences hadn't been positive.
However, it's important to note that CDK, while powerful, isn't without its flaws. In particular, we've observed that certain higher-level (L3) constructs don't always follow AWS Best Practices. A notable example is the ApplicationLoadBalancedFargateService
construct, which has an issue with the read-only root filesystem flag, as reported in this GitHub issue.
Configuration
Once our Fargate application was ready for deployment, the next critical step involved migrating all configuration settings. Heroku's approach to configuration management relies heavily on environment variables. In transitioning to AWS, we chose to leverage a mix of AWS Parameter Store and AWS Secrets Manager to manage our configurations. This shift required modifications to our application code, but it came with significant benefits. We used this opportunity to refactor our configuration management system. This involved purging obsolete configuration properties, incorporating validations and enhancing type safety in our codebase.
User Migration Strategy
Following the successful deployment and configuration of our application in AWS, we moved to transition our user base from Heroku to AWS. We recognized that abruptly migrating all users to the new platform could lead to unforeseen issues. Therefore, our strategy was to initiate the migration with a small user group, progressively expanding to include more users while monitoring the system's stability. This approach also ensured we had a fallback plan to revert to Heroku if necessary.
Phased Rollout
To facilitate this phased rollout, we employed AWS Route53's Geolocation Routing Policy. This feature enables traffic distribution between Heroku and AWS at the DNS level by setting geolocation records for specific countries, continents or US states. Our initial phase targeted users in the Czech Republic, directing them to AWS, while the remaining user base continued on Heroku. This careful, region-by-region migration allowed us to monitor performance and user feedback closely. Once assured of the system's reliability and with no significant issues reported, we gradually redirected more regions to AWS. Eventually, AWS became the default route for all traffic.
GoDaddy & Geolocation-Based Routing
It's worth noting that not all DNS service providers offer geolocation-based routing. Our domain is registered with GoDaddy, which lacks this feature. However, AWS Route53 offers a solution where DNS records can be managed by Route53 while keeping the domain registered with another provider. This enables us to leverage this advanced routing capability without changing our domain registrar.
Secure Database Migration
A primary objective in our migration plan was to enhance the security of our main database by relocating it from the default VPC with public internet accessibility to an isolated subnet within the new VPC designated for our application. The main motivation for this move was to mitigate security risks; having our main database publicly accessible, even with robust passwords and SSL encryption, represented a significant vulnerability.
Database Costs
A significant benefit we've observed is the reduced cost of our database instances. Our database server necessitates a minimum of 256 GB of memory. On Heroku, the most affordable option is the standard-7 instance, which is priced at $3,500 per month. In contrast, a similarly sized reserved instance on AWS (db.x2g.4xlarge
) is available for $1,014. It's important to note that AWS might incur additional charges for storage, backups and monitoring. However, even with these extra costs, AWS remains substantially more economical than Heroku.
Database AZ Migration
Given that our destination subnet resides in a different Availability Zone (AZ), a direct transfer of the database instance wasn't feasible. Instead, our strategy involved:
- Establishing a database subnet group in the target VPC to accommodate the new database environment.
- Initiating the creation of a read replica of our database within the desired region, VPC and subnet.
- Monitoring the replication process until all data was fully transferred and ensuring the replication lag reached zero.
- Implementing a VPN connection for developer access to the new database, ensuring uninterrupted workflow.
- Storing new database instance connection credentials securely in AWS Secrets Manager.
- Temporarily halting the API servers to initiate the migration, marking the start of a planned downtime.
- Promoting the read replica to serve as the new master database.
- Restarting the servers to apply the new credentials and establish a connection with the new database, thereby concluding the downtime.
To minimize service interruption and data loss, it was crucial that steps 6 through 8 were executed swiftly and seamlessly.
Access to Private VPC Resources
Transitioning our main database to a private subnet within a Virtual Private Cloud (VPC) significantly enhanced security but introduced new challenges for our development team. Developers require access to the database for routine tasks such as performing migrations and resolving issues. Traditionally, this access hurdle is overcome with solutions like VPNs or bastion hosts. However, we opted for a more streamlined approach by integrating Twingate, a service that simplifies secure access to private resources.
Implementing Twingate was straightforward:
- We deployed an additional Docker container into our Fargate cluster for Twingate.
- Team members installed the Twingate client app on their computers.
- Logging in through a Google account immediately grants access to resources within the VPC.
Twingate not only facilitated easy permission settings and real-time monitoring but also alleviated common VPN-related hassles, such as certificate management and IP range conflicts. You can also assign aliases to your resources, so your database can have a nice, easy-to-remember hostname like database.production
instead of a hostname AWS assigns. Its user-friendly setup is accessible even to non-technical team members who might need to execute analytical queries on the database. Additionally, Twingate offers a free tier for small teams (up to five members), making it an economically viable option for startups and small projects.
Monitoring
What we like the most about our new setup is the new monitoring system. The consolidation into a single, fully customizable dashboard marks a significant improvement over our previous setup. Previously, our metrics were fragmented across Heroku and AWS, making it challenging to get a cohesive view of our system's health. Now, everything is centralized, allowing immediate visibility into anomalies or issues.
Another notable advantage of our new setup is the enhanced alerting capabilities. AWS allows us to configure alarms that trigger SMS notifications, a feature conspicuously absent in our Heroku setup. This capability ensures that the right people are alerted instantly, regardless of where they are, enabling rapid response to any emerging issues.
App Rollbacks and Restarts
In our operational workflow, the ability to quickly restart the application or roll back to a previous version is crucial. This process was notably straightforward on Heroku, essentially a one-click action within their user interface. However, transitioning to AWS introduced complexity to these tasks. To navigate this, we implemented a solution leveraging AWS Lambda functions, enabling us to perform app restarts and rollbacks either via the AWS Command Line Interface (CLI) or directly from the AWS Management Console.
The integration of AWS CDK into our infrastructure management strategy further enhanced this approach. CDK allows us to dynamically generate a list of services that require action (restart or rollback) and then pass this list to the Lambda function for execution. This not only simplifies what would otherwise be a cumbersome process in AWS but also highlights the versatility and power of using CDK for infrastructure as code (IaC). It marries the operational agility we enjoyed on Heroku with the robustness and scalability of AWS.
CONCLUSION
Our migration project was a success. We fulfilled all the requirements we set at the beginning of the process, and the experience was rich with learning opportunities, particularly in AWS services and DevOps practices at large. Although the migration spanned roughly a year — longer than initially anticipated — this timeframe was influenced by the need to balance other priorities alongside the migration efforts.
A significant achievement of this project was the consolidation of our resources under a single provider, which has fortified our application’s security, enhanced our monitoring capabilities and improved scalability and flexibility of our infrastructure. Notably, the migration process was executed with minimal disruption, ensuring that our users experienced no major outages. This seamless transition speaks volumes about the meticulous planning and execution of our migration strategy.
If you would like to learn how we migrated all our user files from Azure to AWS S3, check this blog post by Jozef Cipa.