Giga Giglet: AWS Architect/SRE (Resilience, Testability & Scalability)

Job Title: AWS Architect/SRE (Resilience, Testability & Scalability)

Location: NYC, NY/Fort Mill, SC (Onsite)

Position Type: Long-term contract

Note: Looking for a seasoned AWS Architect with 15+ years of experience, who has worked in Financial/Wealth Management sectors and possesses skills in AWS Glue and Chaos Engineering/Testing.

We are looking for a hands-on, technically strong Resilience, Testability & Scalability Lead to drive engineering excellence across our data platforms and cloud-based applications. This role is critical in ensuring system uptime, test automation maturity, performance under scale, and architectural resilience to meet stringent regulatory and service-level demands.

The ideal candidate will have a deep background in designing highly available systems, implementing robust disaster recovery, managing scalable cloud infrastructure, and building automated, testable, and observable platforms—especially within AWS and Kubernetes environments.

Key Responsibilities:

Design and implement high availability and failover strategies across multi-zone AWS deployments
Lead the development and execution of disaster recovery and business continuity plans, including RTO/RPO validation and cross-region strategies
Define testability strategies, test data management frameworks, and performance testing protocols
Enable infrastructure and application resilience by introducing circuit breakers, retry patterns, service meshes, and graceful degradation mechanisms
Establish real-time monitoring, alerting, and log aggregation frameworks using tools like CloudWatch and Prometheus D
Drive test automation and quality engineering best practices, integrating with CI/CD pipelines
Optimize application and data layer performance through query tuning, caching, and indexing strategies
Scale data processing using distributed frameworks like Apache Spark, and implement event-driven stream processing with Kafka
Collaborate with platform, DevOps, and SRE teams to ensure resource efficiency, cost control, and performance SLAs
Contribute to regulatory readiness by enforcing security, encryption, and audit logging standards

Required Skills & Experience:

Infrastructure Resilience & DR:

• Multi-AZ deployments, auto-scaling, load balancing, circuit breakers

• Disaster recovery design: backup/restore, cross-region replication, RTO/RPO

Monitoring & Observability:

• Experience with CloudWatch, Prometheus, log aggregators

• Set up alerting for incident response, latency, throughput, and error rates

Application Resilience & Security:

• Error handling, service degradation, exponential backoff

• Security best practices: IAM policies, encryption at rest/transit

• Familiarity with FINRA/SIPC compliance standards (preferred)

Test Automation & Quality:

• Unit testing (e.g., PyTest), integration testing, E2E automation

• Test data generation, synthetic data, environment provisioning

• Performance testing using JMeter, Gatling, stress and capacity testing

• Code reviews, static analysis, data validation, anomaly detection

Scalability & Optimization:

• Horizontal scaling using Kubernetes, Docker, service discovery

• API Gateway, caching layers (Redis, Memcached), DB partitioning

• Connection pooling, capacity planning, cost-aware architecture

Data & Stream Processing:

• Spark cluster management, parallel processing, big data optimization

• Kafka-based messaging, windowing, and aggregation for real-time data

Preferred Qualifications:

• Experience in financial services or regulated environments

• Familiarity with enterprise data and platform modernization initiatives

• AWS or Kubernetes certifications

• Strong communication skills and cross-functional collaboration experience

Please fill the below skill matrix for the client submission::

Skills	No. of Years of Experience	Detailed write up
Highest Education
Certifications (AWS Solutions Architect Pro, Kubernetes Admin, Chaos Engineering Trainer)
Overall AWS Experience - Total years, domains (Finance/Wealth Mgmt)
AWS Glue (ETL workflows, data catalog management, job orchestration)
Chaos Engineering/Testing (Tools (e.g. Chaos Monkey), failure injection, resilience validation)
High Availability / DR (Multi-AZ, cross-region failover, RTO/RPO definitions & validation)
Monitoring & Observability (CloudWatch, Prometheus, log aggregation pipelines (ELK/CloudWatch Logs), alerting setups)
Testability & Automation (CI/CD integration, synthetic data, unit/integration/E2E testing)
Application Resilience (Circuit breakers, retries, backoff patterns, graceful degradation strategies)
Security & Compliance (IAM policies, data encryption (rest/in transit), FINRA/SIPC alignment)
Data & Stream Processing (Apache Spark, Kafka, structured streaming, performance optimization)
Compute & Orchestration (Kubernetes, Docker, auto-scaling groups, API Gateway)
Performance Optimization (Query tuning, caching (Redis/Memcached), DB partitioning)
Disaster Recovery / DR Plans (Backup, restore, cross-region replication, DR drills)
Cost Management (AWS Cost Explorer, capacity planning, scalable architecture strategies)
Infrastructure as Code (Terraform, CloudFormation, config management)
Tools & Languages (Python, Java, Node.js, JMeter, Gatling, Bash scripting for CI/CD)
Communication / Leadership (Stakeholder engagement, cross-functional collaboration, documentation maturity)

Change/Remove Subscription

American IT Systems, 1116 S Walton Blvd, Suite 113, Bentonville, Arkansas 72712 Phone: 315-626-0307

Giga Giglet

Search This Blog

AWS Architect/SRE (Resilience, Testability & Scalability) :: NYC, NY/Fort Mill, SC (Onsite)

No comments:

Post a Comment