Job Title: AWS Architect/SRE (Resilience, Testability & Scalability)
Location: NYC, NY/Fort Mill, SC (Onsite)
Position Type: Long-term contract
Note: Looking for a seasoned AWS Architect with 15+ years of experience, who has worked in Financial/Wealth Management sectors and possesses skills in AWS Glue and Chaos Engineering/Testing.
We are looking for a hands-on, technically strong Resilience, Testability & Scalability Lead to drive engineering excellence across our data platforms and cloud-based applications. This role is critical in ensuring system uptime, test automation maturity, performance under scale, and architectural resilience to meet stringent regulatory and service-level demands.
The ideal candidate will have a deep background in designing highly available systems, implementing robust disaster recovery, managing scalable cloud infrastructure, and building automated, testable, and observable platforms—especially within AWS and Kubernetes environments.
Key Responsibilities:
- Design and implement high availability and failover strategies across multi-zone AWS deployments
- Lead the development and execution of disaster recovery and business continuity plans, including RTO/RPO validation and cross-region strategies
- Define testability strategies, test data management frameworks, and performance testing protocols
- Enable infrastructure and application resilience by introducing circuit breakers, retry patterns, service meshes, and graceful degradation mechanisms
- Establish real-time monitoring, alerting, and log aggregation frameworks using tools like CloudWatch and Prometheus D
- Drive test automation and quality engineering best practices, integrating with CI/CD pipelines
- Optimize application and data layer performance through query tuning, caching, and indexing strategies
- Scale data processing using distributed frameworks like Apache Spark, and implement event-driven stream processing with Kafka
- Collaborate with platform, DevOps, and SRE teams to ensure resource efficiency, cost control, and performance SLAs
- Contribute to regulatory readiness by enforcing security, encryption, and audit logging standards
Required Skills & Experience:
Infrastructure Resilience & DR:
• Multi-AZ deployments, auto-scaling, load balancing, circuit breakers
• Disaster recovery design: backup/restore, cross-region replication, RTO/RPO
Monitoring & Observability:
• Experience with CloudWatch, Prometheus, log aggregators
• Set up alerting for incident response, latency, throughput, and error rates
Application Resilience & Security:
• Error handling, service degradation, exponential backoff
• Security best practices: IAM policies, encryption at rest/transit
• Familiarity with FINRA/SIPC compliance standards (preferred)
Test Automation & Quality:
• Unit testing (e.g., PyTest), integration testing, E2E automation
• Test data generation, synthetic data, environment provisioning
• Performance testing using JMeter, Gatling, stress and capacity testing
• Code reviews, static analysis, data validation, anomaly detection
Scalability & Optimization:
• Horizontal scaling using Kubernetes, Docker, service discovery
• API Gateway, caching layers (Redis, Memcached), DB partitioning
• Connection pooling, capacity planning, cost-aware architecture
Data & Stream Processing:
• Spark cluster management, parallel processing, big data optimization
• Kafka-based messaging, windowing, and aggregation for real-time data
Preferred Qualifications:
• Experience in financial services or regulated environments
• Familiarity with enterprise data and platform modernization initiatives
• AWS or Kubernetes certifications
• Strong communication skills and cross-functional collaboration experience
Please fill the below skill matrix for the client submission::
Skills | No. of Years of Experience | Detailed write up |
Highest Education | | |
Certifications (AWS Solutions Architect Pro, Kubernetes Admin, Chaos Engineering Trainer) | | |
Overall AWS Experience - Total years, domains (Finance/Wealth Mgmt) | | |
AWS Glue (ETL workflows, data catalog management, job orchestration) | | |
Chaos Engineering/Testing (Tools (e.g. Chaos Monkey), failure injection, resilience validation) | | |
High Availability / DR (Multi-AZ, cross-region failover, RTO/RPO definitions & validation) | | |
Monitoring & Observability (CloudWatch, Prometheus, log aggregation pipelines (ELK/CloudWatch Logs), alerting setups) | | |
Testability & Automation (CI/CD integration, synthetic data, unit/integration/E2E testing) | | |
Application Resilience (Circuit breakers, retries, backoff patterns, graceful degradation strategies) | | |
Security & Compliance (IAM policies, data encryption (rest/in transit), FINRA/SIPC alignment) | | |
Data & Stream Processing (Apache Spark, Kafka, structured streaming, performance optimization) | | |
Compute & Orchestration (Kubernetes, Docker, auto-scaling groups, API Gateway) | | |
Performance Optimization (Query tuning, caching (Redis/Memcached), DB partitioning) | | |
Disaster Recovery / DR Plans (Backup, restore, cross-region replication, DR drills) | | |
Cost Management (AWS Cost Explorer, capacity planning, scalable architecture strategies) | | |
Infrastructure as Code (Terraform, CloudFormation, config management) | | |
Tools & Languages (Python, Java, Node.js, JMeter, Gatling, Bash scripting for CI/CD) | | |
Communication / Leadership (Stakeholder engagement, cross-functional collaboration, documentation maturity) | | |
No comments:
Post a Comment
Thanks
Gigagiglet
gigagiglet.blogspot.com