EOP - System Reliability Engineer - TS/SCI Required
Remote
Full Time
EOP - IT & Cyber
Experienced
cFocus Software seeks a System Reliability Engineer to join our program supporting the Executive Office of the President. This position is remote. This position requires a TS/SCI clearance.
Qualifications:
Qualifications:
- 5+ years and Bachelor's Degree in Computer Programming, Science, Engineering or a related technical discipline, or the equivalent combination of education, technical training, or work/military experience, including:
- 3+ years of related systems programming experience
- Experience maintaining an operational environment and use of monitoring tools and dashboard interfaces (ie. Kibana, Grafana)
- Experience working with container images and platforms (Kubernetes/Docker)
- Strong understanding of DevOps and software/application development processes
- Understanding of GitLab, Jenkins, ArgoCD, and other DevOps/Continuous Integration tools for Kubernetes
- Understanding of microservice design and architectural pattern best practices
- Understanding of Python, Bash, and Shell scripting
- Knowledge of network technologies, common infrastructure components, load balancers, firewalls, virtual and physical infrastructure design
- problem solving and troubleshooting skills
- communication and interpersonal skills
- Must possess excellent time management skills and the drive to work unsupervised
- Experience with deploying to on prem/data center infrastructure
- Experience using Jira and Confluence on a daily basis
- Experience in building processes for deploying to a Kubernetes based environment using Gitlab and Helm
- Understanding of access management and security groups (i.e. IAM, S3 bucket, SSH, VPN, etc.)
- Ability to write and use unit and functional testing
- Technical Skills: Proficiency in programming languages (such as Python, Go, or Bash) is essential for scripting and automation tasks. Knowledge of Linux/Unix systems is also crucial, as SREs often work in these environments.
- Problem-Solving: analytical and problem-solving skills are necessary to diagnose and resolve complex system issues effectively.
- Understanding of SRE Principles: Familiarity with key SRE concepts such as Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets is important for measuring and maintaining system reliability.
- Reliability and Availability: SRE practices help ensure that services are consistently available and reliable, which is critical for user satisfaction and business success.
- Scalability: SREs implement strategies that allow systems to scale efficiently as demand increases, ensuring that performance remains optimal even under heavy load.
- Cost Management: By optimizing resource usage and reducing downtime, SREs contribute to cost savings for organizations.
- Programming and Scripting: Proficiency in languages like Python, Go, or Ruby is crucial for automating tasks and managing infrastructure.
- Operating Systems: A strong understanding of Linux/Unix systems is essential for troubleshooting and managing servers.
- Cloud Computing: Familiarity with cloud platforms like AWS, Azure, or Google Cloud is vital for deploying and managing applications in distributed environments.
- Containers & Orchestration: Understanding containerization tools like Docker and managing containerized workloads with Kubernetes is crucial for cloud-native applications.
- Monitoring and Logging: Proficiency in tools like Prometheus, Grafana, or Elasticsearch, Logstash, and Kibana (ELK) Stack is necessary for tracking metrics, setting up alerts, and analyzing logs.
- Networking: Knowledge of networking protocols and configurations is essential for maintaining system health and performance.
- Configuration Management: Skills in managing and maintaining system configurations are critical for ensuring system reliability.
- Incident Response: Ability to respond quickly and effectively to incidents, including documenting and learning from them.
- Security Best Practices: Understanding security protocols and best practices to protect systems from vulnerabilities.
- These skills are essential for SREs to maintain high availability and performance, balancing the demands of development and operations.
- Support required during core business hours of 8am – 5pm, Monday through Friday.
- On-call for evenings or weekends, if needed for outages, application upgrades, security patches or other unplanned activities.
- Monitor system health, availability, and performance using centralized monitoring and logging tools.
- Administration of accounts (role-based access and rights).
- Manage accessibility to the application through EOP’s authentication systems.
- Manage the workflow templates to ensure consistent and predictable task flows.
- Configure workflow management for new or adjustments based on user requests, while adhering to EOP template standards.
- Maintain configurations and configurable fields for users and workflows.
- Maintain the test environment to mimic production and conduct test and evaluation in the environment prior to deployments.
- Design and maintain a secure and reliable form of backups, ensuring High Availability (HA) and resiliency.
- Develop a Disaster Recovery (DR) or Incident Response (IR) plan for specific applications and services in the event of a disaster or unexpected downtime.
- Maintain unique instances that support various offices.
- Configure and support integrations with complementary systems.
- Establish and Improve system monitoring while maintaining established security protocols within development, test, and production systems.
- Architect, build and maintain on premise and/or cloud infrastructure to support team and customer initiatives.
- Maintain and improve existing infrastructure (build out autoscaling, support new services, optimize for cost efficiencies/authentication/search, etc.).
- Administer production, staging and development environments.
- Manage and aggregate server logs and monitor for security and system related incidents.
- Monitor and analyze system performance, such as server load and resource usage.
- Maintain and improve existing build and deployment processes using CI/CD tools.
- Apply configuration management disciplines to maintain software revisions, security patches, hardening, and documentation.
- Enforce best practices for security and reliability, and drive security initiatives, like access control and vulnerability testing.
- Maintain up to date documentation of designs/configurations, ensuring team members have continuity of recurring tasks.
- Maintain status of operations at all times: perform after actions reporting on all outages and work with engineering teams to determine solution and root cause analysis. Present findings to management for prioritization and tasking.
- Create and determine required metrics for dashboards and service health.
- Follow up on engineering tasks for operational solutions, and validate completion
- Manage operational readiness board – present at weekly meetings and determine if development services are ready for automation based on best practices and maintainability.
- Track and ensure routine operations maintenance tasks are completed in a timely manner.
- Align to the customer's strategies for configuration of workflows, without compromising the integrity of the workflow tool and templates.
- Build, maintain, and utilize the customer's enterprise Development, Security, and Operations (DevSecOps) pipeline.
- Work with other service providers to support areas of common interest.
- On-call support may be required.
Apply for this position
Required*