Complete guide to DevOps implementation for software development teams. Learn CI/CD pipelines, automation strategies, and best practices for modern development workflows.
DevOps Implementation for Software Development Teams
The DevOps market is experiencing unprecedented growth, projected to reach $25.5 billion by 2028, with a CAGR of 19.1%. As organizations recognize the critical importance of rapid, reliable software delivery, DevOps has evolved from a cultural movement to an essential business capability. This comprehensive guide provides software development teams with practical strategies, tools, and best practices for implementing DevOps that accelerates delivery, improves quality, and enhances collaboration across the entire software development lifecycle.
Understanding DevOps Fundamentals
Core DevOps Principles
Culture of Collaboration DevOps breaks down traditional silos between development, operations, and quality assurance teams, fostering a culture of shared responsibility for the entire software lifecycle.
Automation First Automation is the cornerstone of DevOps, eliminating manual processes, reducing errors, and enabling consistent, repeatable operations across development, testing, and deployment.
Continuous Improvement DevOps embraces continuous learning and improvement through feedback loops, metrics-driven decision making, and iterative enhancement of processes and tools.
Infrastructure as Code Infrastructure is managed through code, enabling version control, repeatability, and automated provisioning of environments.
The DevOps Lifecycle
graph LR
A[Plan] --> B[Code]
B --> C[Build]
C --> D[Test]
D --> E[Release]
E --> F[Deploy]
F --> G[Operate]
G --> H[Monitor]
H --> A
Each phase of the DevOps lifecycle includes specific practices, tools, and metrics that contribute to the overall goal of faster, more reliable software delivery.
Continuous Integration (CI) Implementation
CI Pipeline Architecture
Modern CI Pipeline Design
# GitLab CI Pipeline Example
stages:
- validate
- test
- security
- build
- package
- deploy-staging
- integration-tests
- deploy-production
variables:
DOCKER_REGISTRY: "registry.company.com"
APP_NAME: "microservice-app"
KUBERNETES_NAMESPACE: "production"
# Validation Stage
code-quality:
stage: validate
image: node:18-alpine
script:
- npm ci
- npm run lint
- npm run prettier:check
- npm run type-check
artifacts:
reports:
junit: reports/lint-results.xml
coverage: '/Coverage: \d+\.\d+%/'
security-scan:
stage: validate
image: securecodewarrior/sca-tools:latest
script:
- npm audit --audit-level moderate
- snyk test --severity-threshold=high
- truffleHog --regex --entropy=False .
artifacts:
reports:
sast: security-report.json
allow_failure: false
# Testing Stage
unit-tests:
stage: test
image: node:18-alpine
script:
- npm ci
- npm run test:unit -- --coverage --watchAll=false
artifacts:
reports:
junit: reports/junit.xml
coverage_report:
coverage_format: cobertura
path: coverage/cobertura-coverage.xml
paths:
- coverage/
coverage: '/Lines\s*:\s*(\d+\.\d+)%/'
integration-tests:
stage: test
image: node:18-alpine
services:
- postgres:13
- redis:6
variables:
DATABASE_URL: "postgresql://test:test@postgres:5432/testdb"
REDIS_URL: "redis://redis:6379"
script:
- npm ci
- npm run test:integration
artifacts:
reports:
junit: reports/integration-test-results.xml
e2e-tests:
stage: test
image: mcr.microsoft.com/playwright:latest
script:
- npm ci
- npm run build
- npm run test:e2e
artifacts:
when: always
paths:
- test-results/
- playwright-report/
retry:
max: 2
when: runner_system_failure
# Build Stage
build-application:
stage: build
image: docker:latest
services:
- docker:dind
before_script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $DOCKER_REGISTRY
script:
- |
# Multi-stage Docker build
docker build \
--build-arg BUILD_VERSION=$CI_COMMIT_SHA \
--build-arg BUILD_DATE=$(date -u +"%Y-%m-%dT%H:%M:%SZ") \
--tag $DOCKER_REGISTRY/$APP_NAME:$CI_COMMIT_SHA \
--tag $DOCKER_REGISTRY/$APP_NAME:latest \
.
- docker push $DOCKER_REGISTRY/$APP_NAME:$CI_COMMIT_SHA
- docker push $DOCKER_REGISTRY/$APP_NAME:latest
only:
- main
- develop
# Package Stage
helm-package:
stage: package
image: alpine/helm:latest
script:
- |
# Package Helm chart
helm package helm-chart/ \
--version $CI_COMMIT_SHA \
--app-version $CI_COMMIT_SHA
- |
# Upload to Helm repository
curl --data-binary "@${APP_NAME}-${CI_COMMIT_SHA}.tgz" \
$HELM_REPO_URL/api/charts
artifacts:
paths:
- "*.tgz"
only:
- main
# Deployment Stages
deploy-staging:
stage: deploy-staging
image: bitnami/kubectl:latest
environment:
name: staging
url: https://staging.company.com
script:
- |
# Deploy to staging using Helm
helm upgrade --install $APP_NAME-staging \
helm-chart/ \
--namespace staging \
--set image.tag=$CI_COMMIT_SHA \
--set ingress.host=staging.company.com \
--set resources.limits.memory=512Mi \
--set replicaCount=2
- |
# Wait for deployment to complete
kubectl rollout status deployment/$APP_NAME-staging \
--namespace staging \
--timeout=300s
only:
- main
- develop
production-deployment:
stage: deploy-production
image: bitnami/kubectl:latest
environment:
name: production
url: https://company.com
script:
- |
# Blue-green deployment strategy
helm upgrade --install $APP_NAME-production \
helm-chart/ \
--namespace production \
--set image.tag=$CI_COMMIT_SHA \
--set ingress.host=company.com \
--set resources.limits.memory=1Gi \
--set replicaCount=5 \
--set strategy.type=BlueGreen
- |
# Health check and traffic switch
kubectl rollout status deployment/$APP_NAME-production \
--namespace production \
--timeout=600s
when: manual
only:
- main
Advanced CI Practices
Parallel Pipeline Execution
// Jenkins Pipeline with Parallel Stages
pipeline {
agent any
environment {
DOCKER_REGISTRY = 'your-registry.com'
APP_NAME = 'microservice-app'
SONAR_HOST = 'https://sonarqube.company.com'
}
stages {
stage('Checkout') {
steps {
checkout scm
script {
env.GIT_COMMIT_SHORT = sh(
script: 'git rev-parse --short HEAD',
returnStdout: true
).trim()
env.BUILD_VERSION = "${env.BUILD_NUMBER}-${env.GIT_COMMIT_SHORT}"
}
}
}
stage('Parallel Analysis') {
parallel {
stage('Code Quality') {
steps {
script {
// SonarQube analysis
def sonarScannerHome = tool 'SonarQubeScanner'
withSonarQubeEnv('SonarQube') {
sh """
${sonarScannerHome}/bin/sonar-scanner \
-Dsonar.projectKey=${APP_NAME} \
-Dsonar.projectVersion=${BUILD_VERSION} \
-Dsonar.sources=src \
-Dsonar.tests=tests \
-Dsonar.typescript.lcov.reportPaths=coverage/lcov.info
"""
}
}
}
}
stage('Security Scan') {
steps {
script {
// OWASP Dependency Check
sh 'npm audit --audit-level moderate'
// Snyk security scanning
sh """
npx snyk test \
--severity-threshold=high \
--json > snyk-results.json || true
"""
// Trivy filesystem scan
sh """
trivy fs . \
--format json \
--output trivy-results.json \
--severity HIGH,CRITICAL
"""
}
}
post {
always {
publishHTML([
allowMissing: false,
alwaysLinkToLastBuild: true,
keepAll: true,
reportDir: '.',
reportFiles: 'snyk-results.json',
reportName: 'Snyk Security Report'
])
}
}
}
stage('License Compliance') {
steps {
script {
// License scanning
sh """
npx license-checker \
--json \
--out license-report.json \
--excludePackages '[email protected]'
"""
// FOSSA analysis for enterprise
sh """
fossa analyze \
--project ${APP_NAME} \
--revision ${GIT_COMMIT_SHORT}
"""
}
}
}
}
}
stage('Testing Suite') {
parallel {
stage('Unit Tests') {
steps {
sh 'npm run test:unit -- --coverage --ci'
}
post {
always {
publishTestResults testResultsPattern: 'test-results/unit/junit.xml'
publishCoverageResults([
coberturaReportFile: 'coverage/cobertura-coverage.xml'
])
}
}
}
stage('Integration Tests') {
steps {
script {
// Start test services
sh '''
docker-compose -f docker-compose.test.yml up -d
sleep 30 # Wait for services to be ready
'''
try {
sh 'npm run test:integration'
} finally {
// Cleanup
sh 'docker-compose -f docker-compose.test.yml down'
}
}
}
}
stage('Performance Tests') {
steps {
script {
// JMeter performance testing
sh """
jmeter -n -t performance-tests/load-test.jmx \
-l performance-results.jtl \
-j jmeter.log \
-Jthreads=50 \
-Jrampup=60 \
-Jduration=300
"""
}
}
post {
always {
perfReport sourceDataFiles: 'performance-results.jtl'
}
}
}
}
}
stage('Build & Package') {
when {
anyOf {
branch 'main'
branch 'develop'
changeRequest()
}
}
steps {
script {
// Multi-platform Docker build
sh """
docker buildx build \
--platform linux/amd64,linux/arm64 \
--build-arg BUILD_VERSION=${BUILD_VERSION} \
--build-arg GIT_COMMIT=${GIT_COMMIT_SHORT} \
--tag ${DOCKER_REGISTRY}/${APP_NAME}:${BUILD_VERSION} \
--tag ${DOCKER_REGISTRY}/${APP_NAME}:latest \
--push .
"""
// Security scan of built image
sh """
trivy image \
--format json \
--output image-scan-results.json \
${DOCKER_REGISTRY}/${APP_NAME}:${BUILD_VERSION}
"""
}
}
}
stage('Deploy to Staging') {
when {
anyOf {
branch 'main'
branch 'develop'
}
}
steps {
script {
// Deploy using Helm
sh """
helm upgrade --install ${APP_NAME}-staging \
./helm-chart \
--namespace staging \
--set image.tag=${BUILD_VERSION} \
--set environment=staging \
--wait --timeout=5m
"""
// Run smoke tests
sh 'npm run test:smoke -- --env=staging'
}
}
}
stage('Production Deployment Approval') {
when {
branch 'main'
}
steps {
script {
// Quality gate check
timeout(time: 5, unit: 'MINUTES') {
waitForQualityGate abortPipeline: true
}
// Manual approval for production
input message: 'Deploy to production?',
submitter: 'admin,devops-team'
}
}
}
stage('Production Deployment') {
when {
branch 'main'
}
steps {
script {
// Blue-Green deployment
sh """
helm upgrade --install ${APP_NAME}-production \
./helm-chart \
--namespace production \
--set image.tag=${BUILD_VERSION} \
--set environment=production \
--set replicaCount=5 \
--wait --timeout=10m
"""
// Verify deployment
sh 'npm run test:production-health'
}
}
}
}
post {
always {
// Archive artifacts
archiveArtifacts artifacts: '**/*.json,**/*.xml,**/*.log',
allowEmptyArchive: true
// Clean workspace
cleanWs()
}
failure {
// Send Slack notification on failure
slackSend channel: '#devops-alerts',
color: 'danger',
message: """
🚨 Pipeline Failed: ${env.JOB_NAME} - ${env.BUILD_NUMBER}
Branch: ${env.BRANCH_NAME}
Commit: ${env.GIT_COMMIT_SHORT}
Duration: ${currentBuild.durationString}
View logs: ${env.BUILD_URL}
"""
}
success {
slackSend channel: '#deployments',
color: 'good',
message: """
✅ Deployment Successful: ${env.JOB_NAME} - ${env.BUILD_NUMBER}
Environment: Production
Version: ${BUILD_VERSION}
Duration: ${currentBuild.durationString}
"""
}
}
}
Continuous Deployment (CD) Strategies
Deployment Patterns
Blue-Green Deployment Implementation
import kubernetes
import time
import requests
from typing import Dict, List
import logging
class BlueGreenDeployment:
def __init__(self, namespace: str, app_name: str, kube_config_path: str):
self.namespace = namespace
self.app_name = app_name
self.kube_client = self.setup_kubernetes_client(kube_config_path)
self.apps_v1 = kubernetes.client.AppsV1Api()
self.core_v1 = kubernetes.client.CoreV1Api()
self.logger = logging.getLogger(__name__)
def setup_kubernetes_client(self, config_path: str):
"""Initialize Kubernetes client"""
kubernetes.config.load_kube_config(config_file=config_path)
return kubernetes.client.ApiClient()
def deploy(self, new_image: str, health_check_url: str, rollback_on_failure: bool = True) -> Dict:
"""
Execute blue-green deployment
"""
deployment_result = {
'status': 'started',
'blue_version': None,
'green_version': None,
'traffic_switched': False,
'rollback_performed': False
}
try:
# Step 1: Identify current (blue) and new (green) versions
current_deployment = self.get_current_deployment()
if current_deployment:
blue_name = current_deployment.metadata.name
green_name = f"{self.app_name}-green" if "green" not in blue_name else f"{self.app_name}-blue"
else:
blue_name = f"{self.app_name}-blue"
green_name = f"{self.app_name}-green"
deployment_result['blue_version'] = blue_name
deployment_result['green_version'] = green_name
self.logger.info(f"Starting blue-green deployment: {blue_name} -> {green_name}")
# Step 2: Deploy green version
green_deployment = self.create_green_deployment(green_name, new_image)
# Step 3: Wait for green deployment to be ready
if not self.wait_for_deployment_ready(green_name, timeout=600):
raise Exception(f"Green deployment {green_name} failed to become ready")
# Step 4: Perform health checks on green version
green_service_url = self.create_temporary_service(green_name)
if not self.perform_health_checks(green_service_url + health_check_url):
raise Exception("Green deployment failed health checks")
# Step 5: Switch traffic from blue to green
self.switch_traffic_to_green(green_name)
deployment_result['traffic_switched'] = True
# Step 6: Verify production traffic health
if not self.verify_production_health(health_check_url):
if rollback_on_failure:
self.rollback_to_blue(blue_name)
deployment_result['rollback_performed'] = True
raise Exception("Production health check failed, rolled back to blue")
else:
raise Exception("Production health check failed")
# Step 7: Cleanup old blue deployment
self.cleanup_old_deployment(blue_name)
deployment_result['status'] = 'completed'
self.logger.info(f"Blue-green deployment completed successfully")
return deployment_result
except Exception as e:
self.logger.error(f"Blue-green deployment failed: {str(e)}")
deployment_result['status'] = 'failed'
deployment_result['error'] = str(e)
if rollback_on_failure and deployment_result['traffic_switched']:
try:
self.rollback_to_blue(blue_name)
deployment_result['rollback_performed'] = True
except Exception as rollback_error:
self.logger.error(f"Rollback failed: {str(rollback_error)}")
deployment_result['rollback_error'] = str(rollback_error)
raise e
def create_green_deployment(self, green_name: str, new_image: str) -> kubernetes.client.V1Deployment:
"""Create green deployment with new image"""
# Get current deployment specification as template
current_deployment = self.get_current_deployment()
if current_deployment:
green_spec = current_deployment.spec
green_metadata = current_deployment.metadata
else:
# Create default deployment spec
green_spec = self.create_default_deployment_spec()
green_metadata = kubernetes.client.V1ObjectMeta(name=green_name)
# Update metadata
green_metadata.name = green_name
green_metadata.labels = green_metadata.labels or {}
green_metadata.labels['version'] = 'green'
green_metadata.labels['deployment-strategy'] = 'blue-green'
# Update container image
green_spec.template.spec.containers[0].image = new_image
# Update selector and template labels
green_spec.selector.match_labels['version'] = 'green'
green_spec.template.metadata.labels = green_spec.template.metadata.labels or {}
green_spec.template.metadata.labels['version'] = 'green'
# Create deployment
green_deployment = kubernetes.client.V1Deployment(
metadata=green_metadata,
spec=green_spec
)
try:
# Delete existing green deployment if it exists
self.apps_v1.delete_namespaced_deployment(
name=green_name,
namespace=self.namespace
)
time.sleep(5) # Wait for cleanup
except kubernetes.client.exceptions.ApiException:
pass # Deployment doesn't exist, which is fine
# Create new green deployment
created_deployment = self.apps_v1.create_namespaced_deployment(
namespace=self.namespace,
body=green_deployment
)
self.logger.info(f"Created green deployment: {green_name}")
return created_deployment
def wait_for_deployment_ready(self, deployment_name: str, timeout: int = 600) -> bool:
"""Wait for deployment to be ready"""
start_time = time.time()
while time.time() - start_time < timeout:
try:
deployment = self.apps_v1.read_namespaced_deployment(
name=deployment_name,
namespace=self.namespace
)
# Check if deployment is ready
if (deployment.status.ready_replicas and
deployment.status.ready_replicas == deployment.spec.replicas):
self.logger.info(f"Deployment {deployment_name} is ready")
return True
self.logger.info(f"Waiting for deployment {deployment_name} to be ready...")
time.sleep(10)
except kubernetes.client.exceptions.ApiException as e:
self.logger.error(f"Error checking deployment status: {e}")
time.sleep(10)
self.logger.error(f"Deployment {deployment_name} failed to become ready within {timeout} seconds")
return False
def perform_health_checks(self, health_url: str, max_attempts: int = 10) -> bool:
"""Perform health checks on the green deployment"""
for attempt in range(max_attempts):
try:
response = requests.get(health_url, timeout=10)
if response.status_code == 200:
health_data = response.json()
if health_data.get('status') == 'healthy':
self.logger.info(f"Health check passed on attempt {attempt + 1}")
return True
else:
self.logger.warning(f"Health check returned unhealthy status: {health_data}")
except Exception as e:
self.logger.warning(f"Health check attempt {attempt + 1} failed: {e}")
if attempt < max_attempts - 1:
time.sleep(30) # Wait 30 seconds between attempts
self.logger.error(f"All {max_attempts} health check attempts failed")
return False
def switch_traffic_to_green(self, green_name: str):
"""Switch service traffic to green deployment"""
service_name = f"{self.app_name}-service"
try:
# Get current service
service = self.core_v1.read_namespaced_service(
name=service_name,
namespace=self.namespace
)
# Update selector to point to green deployment
service.spec.selector['version'] = 'green'
# Update service
self.core_v1.patch_namespaced_service(
name=service_name,
namespace=self.namespace,
body=service
)
self.logger.info(f"Switched traffic to green deployment: {green_name}")
except kubernetes.client.exceptions.ApiException as e:
self.logger.error(f"Failed to switch traffic: {e}")
raise e
def rollback_to_blue(self, blue_name: str):
"""Rollback traffic to blue deployment"""
service_name = f"{self.app_name}-service"
try:
service = self.core_v1.read_namespaced_service(
name=service_name,
namespace=self.namespace
)
service.spec.selector['version'] = 'blue'
self.core_v1.patch_namespaced_service(
name=service_name,
namespace=self.namespace,
body=service
)
self.logger.info(f"Rolled back traffic to blue deployment: {blue_name}")
except kubernetes.client.exceptions.ApiException as e:
self.logger.error(f"Failed to rollback: {e}")
raise e
# Canary Deployment Implementation
class CanaryDeployment:
def __init__(self, namespace: str, app_name: str):
self.namespace = namespace
self.app_name = app_name
self.istio_client = self.setup_istio_client()
self.logger = logging.getLogger(__name__)
def deploy_canary(self, new_image: str, canary_percentage: int = 10,
success_threshold: float = 0.95, duration_minutes: int = 30) -> Dict:
"""
Execute canary deployment with traffic splitting
"""
deployment_result = {
'status': 'started',
'canary_percentage': canary_percentage,
'success_rate': 0.0,
'error_rate': 0.0,
'promoted': False
}
try:
# Step 1: Deploy canary version
canary_name = f"{self.app_name}-canary"
self.create_canary_deployment(canary_name, new_image)
# Step 2: Configure traffic splitting
self.configure_traffic_split(canary_percentage)
# Step 3: Monitor canary metrics
monitoring_duration = duration_minutes * 60 # Convert to seconds
metrics = self.monitor_canary_metrics(monitoring_duration)
deployment_result['success_rate'] = metrics['success_rate']
deployment_result['error_rate'] = metrics['error_rate']
# Step 4: Decide promotion or rollback
if metrics['success_rate'] >= success_threshold and metrics['error_rate'] < 0.05:
# Promote canary to production
self.promote_canary()
deployment_result['promoted'] = True
deployment_result['status'] = 'promoted'
self.logger.info("Canary deployment promoted to production")
else:
# Rollback canary
self.rollback_canary()
deployment_result['status'] = 'rolled_back'
self.logger.warning(f"Canary deployment rolled back due to poor metrics")
return deployment_result
except Exception as e:
self.logger.error(f"Canary deployment failed: {str(e)}")
deployment_result['status'] = 'failed'
deployment_result['error'] = str(e)
# Attempt rollback
try:
self.rollback_canary()
except Exception as rollback_error:
self.logger.error(f"Canary rollback failed: {str(rollback_error)}")
raise e
def monitor_canary_metrics(self, duration_seconds: int) -> Dict:
"""Monitor canary deployment metrics"""
start_time = time.time()
success_count = 0
error_count = 0
total_requests = 0
while time.time() - start_time < duration_seconds:
try:
# Query Prometheus for metrics
canary_metrics = self.query_prometheus_metrics()
success_count += canary_metrics.get('success_count', 0)
error_count += canary_metrics.get('error_count', 0)
total_requests += canary_metrics.get('total_requests', 0)
# Log current metrics
if total_requests > 0:
current_success_rate = success_count / total_requests
current_error_rate = error_count / total_requests
self.logger.info(f"Canary metrics - Success: {current_success_rate:.2%}, "
f"Error: {current_error_rate:.2%}, Total: {total_requests}")
time.sleep(60) # Check every minute
except Exception as e:
self.logger.warning(f"Error collecting metrics: {e}")
time.sleep(60)
# Calculate final metrics
final_success_rate = success_count / total_requests if total_requests > 0 else 0
final_error_rate = error_count / total_requests if total_requests > 0 else 0
return {
'success_rate': final_success_rate,
'error_rate': final_error_rate,
'total_requests': total_requests,
'success_count': success_count,
'error_count': error_count
}
Infrastructure as Code (IaC)
Terraform Enterprise Infrastructure
Multi-Environment Infrastructure Management
# main.tf - Root module for enterprise infrastructure
terraform {
required_version = ">= 1.5"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.20"
}
helm = {
source = "hashicorp/helm"
version = "~> 2.10"
}
}
backend "s3" {
bucket = "company-terraform-state"
key = "infrastructure/terraform.tfstate"
region = "us-west-2"
encrypt = true
dynamodb_table = "terraform-state-locks"
}
}
# Variables
variable "environment" {
description = "Environment name (dev, staging, production)"
type = string
validation {
condition = contains(["dev", "staging", "production"], var.environment)
error_message = "Environment must be dev, staging, or production."
}
}
variable "region" {
description = "AWS region"
type = string
default = "us-west-2"
}
variable "app_name" {
description = "Application name"
type = string
}
variable "team" {
description = "Team responsible for the infrastructure"
type = string
}
# Local values
locals {
common_tags = {
Environment = var.environment
Application = var.app_name
Team = var.team
ManagedBy = "Terraform"
CostCenter = "Engineering"
Backup = var.environment == "production" ? "required" : "optional"
}
# Environment-specific configurations
env_config = {
dev = {
instance_types = ["t3.medium"]
min_capacity = 1
max_capacity = 3
desired_capacity = 2
db_instance_class = "db.t3.micro"
backup_retention = 7
}
staging = {
instance_types = ["t3.large"]
min_capacity = 2
max_capacity = 5
desired_capacity = 3
db_instance_class = "db.t3.small"
backup_retention = 14
}
production = {
instance_types = ["m5.large", "m5.xlarge"]
min_capacity = 3
max_capacity = 20
desired_capacity = 5
db_instance_class = "db.r5.large"
backup_retention = 30
}
}
}
# Data sources
data "aws_availability_zones" "available" {
state = "available"
}
data "aws_caller_identity" "current" {}
# VPC Module
module "vpc" {
source = "./modules/vpc"
name = "${var.app_name}-${var.environment}"
cidr = "10.${var.environment == "production" ? 0 : (var.environment == "staging" ? 1 : 2)}.0.0/16"
azs = data.aws_availability_zones.available.names
private_subnets = [
"10.${var.environment == "production" ? 0 : (var.environment == "staging" ? 1 : 2)}.1.0/24",
"10.${var.environment == "production" ? 0 : (var.environment == "staging" ? 1 : 2)}.2.0/24",
"10.${var.environment == "production" ? 0 : (var.environment == "staging" ? 1 : 2)}.3.0/24"
]
public_subnets = [
"10.${var.environment == "production" ? 0 : (var.environment == "staging" ? 1 : 2)}.101.0/24",
"10.${var.environment == "production" ? 0 : (var.environment == "staging" ? 1 : 2)}.102.0/24",
"10.${var.environment == "production" ? 0 : (var.environment == "staging" ? 1 : 2)}.103.0/24"
]
database_subnets = [
"10.${var.environment == "production" ? 0 : (var.environment == "staging" ? 1 : 2)}.201.0/24",
"10.${var.environment == "production" ? 0 : (var.environment == "staging" ? 1 : 2)}.202.0/24",
"10.${var.environment == "production" ? 0 : (var.environment == "staging" ? 1 : 2)}.203.0/24"
]
enable_nat_gateway = true
enable_vpn_gateway = var.environment == "production"
enable_dns_hostnames = true
enable_dns_support = true
# Flow logs
enable_flow_log = true
flow_log_destination_type = "cloud-watch-logs"
tags = local.common_tags
}
# EKS Module
module "eks" {
source = "./modules/eks"
cluster_name = "${var.app_name}-${var.environment}"
cluster_version = "1.27"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
# Node groups
node_groups = {
general = {
name = "general"
instance_types = local.env_config[var.environment].instance_types
min_capacity = local.env_config[var.environment].min_capacity
max_capacity = local.env_config[var.environment].max_capacity
desired_capacity = local.env_config[var.environment].desired_capacity
k8s_labels = {
Environment = var.environment
NodeGroup = "general"
}
additional_tags = local.common_tags
}
}
# Add-ons
cluster_addons = {
coredns = {
resolve_conflicts = "OVERWRITE"
}
kube-proxy = {}
vpc-cni = {
resolve_conflicts = "OVERWRITE"
}
aws-ebs-csi-driver = {
resolve_conflicts = "OVERWRITE"
}
}
# RBAC
manage_aws_auth_configmap = true
aws_auth_roles = [
{
rolearn = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/DevOpsRole"
username = "devops"
groups = ["system:masters"]
}
]
tags = local.common_tags
}
# RDS Module
module "rds" {
source = "./modules/rds"
identifier = "${var.app_name}-${var.environment}"
engine = "postgres"
engine_version = "14.8"
instance_class = local.env_config[var.environment].db_instance_class
allocated_storage = var.environment == "production" ? 100 : 20
max_allocated_storage = var.environment == "production" ? 1000 : 100
storage_encrypted = true
db_name = "${replace(var.app_name, "-", "_")}_${var.environment}"
username = "app_user"
vpc_security_group_ids = [module.vpc.database_security_group_id]
db_subnet_group_name = module.vpc.database_subnet_group
backup_retention_period = local.env_config[var.environment].backup_retention
backup_window = "03:00-04:00"
maintenance_window = "sun:04:00-sun:05:00"
deletion_protection = var.environment == "production"
skip_final_snapshot = var.environment != "production"
performance_insights_enabled = var.environment == "production"
monitoring_interval = var.environment == "production" ? 60 : 0
tags = local.common_tags
}
# Redis Module
module "redis" {
source = "./modules/redis"
cluster_id = "${var.app_name}-${var.environment}"
node_type = var.environment == "production" ? "cache.r6g.large" : "cache.t3.micro"
num_cache_nodes = var.environment == "production" ? 3 : 1
parameter_group = "default.redis7"
port = 6379
subnet_group_name = module.vpc.elasticache_subnet_group_name
security_group_ids = [module.vpc.elasticache_security_group_id]
at_rest_encryption_enabled = true
transit_encryption_enabled = true
maintenance_window = "sun:05:00-sun:06:00"
snapshot_window = "03:00-05:00"
snapshot_retention_limit = var.environment == "production" ? 7 : 3
tags = local.common_tags
}
# Monitoring Module
module "monitoring" {
source = "./modules/monitoring"
cluster_name = module.eks.cluster_id
environment = var.environment
app_name = var.app_name
# Prometheus configuration
prometheus_namespace = "monitoring"
grafana_namespace = "monitoring"
# Alert manager configuration
alert_manager_config = {
smtp_host = "smtp.company.com"
smtp_port = 587
smtp_username = "[email protected]"
webhook_url = "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
}
# Grafana configuration
grafana_admin_password = var.grafana_admin_password
grafana_ingress_host = "grafana-${var.environment}.company.com"
tags = local.common_tags
}
# Output values
output "vpc_id" {
description = "VPC ID"
value = module.vpc.vpc_id
}
output "eks_cluster_endpoint" {
description = "EKS cluster endpoint"
value = module.eks.cluster_endpoint
sensitive = true
}
output "eks_cluster_name" {
description = "EKS cluster name"
value = module.eks.cluster_id
}
output "rds_endpoint" {
description = "RDS endpoint"
value = module.rds.db_instance_endpoint
sensitive = true
}
output "redis_endpoint" {
description = "Redis endpoint"
value = module.redis.cache_nodes
sensitive = true
}
Configuration Management with Ansible
Automated Server Configuration
# playbooks/site.yml - Main playbook for server configuration
---
- name: Configure Development Environment
hosts: development
become: yes
vars:
environment: development
app_name: microservice-app
deploy_user: deploy
docker_users:
- "{{ deploy_user }}"
- jenkins
pre_tasks:
- name: Update system packages
package:
name: "*"
state: latest
when: ansible_os_family == "RedHat"
- name: Update apt cache
apt:
update_cache: yes
cache_valid_time: 3600
when: ansible_os_family == "Debian"
roles:
- common
- docker
- kubernetes
- monitoring
- security
- name: Configure Production Environment
hosts: production
become: yes
vars:
environment: production
app_name: microservice-app
deploy_user: deploy
security_hardening: true
pre_tasks:
- name: Verify production deployment authorization
pause:
prompt: "Are you authorized to deploy to production? (yes/no)"
register: production_auth
- name: Fail if not authorized
fail:
msg: "Production deployment not authorized"
when: production_auth.user_input != "yes"
roles:
- common
- docker
- kubernetes
- monitoring
- security
- backup
- compliance
# roles/docker/tasks/main.yml
---
- name: Install Docker dependencies
package:
name:
- apt-transport-https
- ca-certificates
- curl
- gnupg
- lsb-release
state: present
when: ansible_os_family == "Debian"
- name: Add Docker's official GPG key
apt_key:
url: https://download.docker.com/linux/ubuntu/gpg
state: present
when: ansible_os_family == "Debian"
- name: Add Docker repository
apt_repository:
repo: deb https://download.docker.com/linux/ubuntu {{ ansible_distribution_release }} stable
state: present
when: ansible_os_family == "Debian"
- name: Install Docker Engine
package:
name:
- docker-ce
- docker-ce-cli
- containerd.io
- docker-compose-plugin
state: present
- name: Start and enable Docker service
systemd:
name: docker
state: started
enabled: yes
- name: Add users to docker group
user:
name: "{{ item }}"
groups: docker
append: yes
loop: "{{ docker_users }}"
- name: Configure Docker daemon
copy:
content: |
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
},
"storage-driver": "overlay2",
"live-restore": true,
"userland-proxy": false,
"experimental": false,
"metrics-addr": "127.0.0.1:9323",
"insecure-registries": []
}
dest: /etc/docker/daemon.json
owner: root
group: root
mode: '0644'
notify: restart docker
- name: Install Docker Compose
pip:
name: docker-compose
state: present
- name: Verify Docker installation
command: docker --version
register: docker_version
- name: Display Docker version
debug:
msg: "Docker installed: {{ docker_version.stdout }}"
# roles/kubernetes/tasks/main.yml
---
- name: Add Kubernetes APT repository key
apt_key:
url: https://packages.cloud.google.com/apt/doc/apt-key.gpg
state: present
when: ansible_os_family == "Debian"
- name: Add Kubernetes APT repository
apt_repository:
repo: deb https://apt.kubernetes.io/ kubernetes-xenial main
state: present
when: ansible_os_family == "Debian"
- name: Install Kubernetes tools
package:
name:
- kubectl
- kubeadm
- kubelet
state: present
- name: Hold Kubernetes packages
dpkg_selections:
name: "{{ item }}"
selection: hold
loop:
- kubectl
- kubeadm
- kubelet
when: ansible_os_family == "Debian"
- name: Install Helm
get_url:
url: https://get.helm.sh/helm-v3.12.0-linux-amd64.tar.gz
dest: /tmp/helm.tar.gz
- name: Extract Helm
unarchive:
src: /tmp/helm.tar.gz
dest: /tmp
remote_src: yes
- name: Install Helm binary
copy:
src: /tmp/linux-amd64/helm
dest: /usr/local/bin/helm
mode: '0755'
remote_src: yes
- name: Create kubeconfig directory
file:
path: /home/{{ deploy_user }}/.kube
state: directory
owner: "{{ deploy_user }}"
group: "{{ deploy_user }}"
mode: '0755'
- name: Install kubectl bash completion
shell: kubectl completion bash > /etc/bash_completion.d/kubectl
# roles/monitoring/tasks/main.yml
---
- name: Create monitoring user
user:
name: monitoring
system: yes
shell: /bin/false
home: /var/lib/monitoring
createhome: yes
- name: Install Node Exporter
get_url:
url: https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz
dest: /tmp/node_exporter.tar.gz
- name: Extract Node Exporter
unarchive:
src: /tmp/node_exporter.tar.gz
dest: /tmp
remote_src: yes
- name: Install Node Exporter binary
copy:
src: /tmp/node_exporter-1.6.0.linux-amd64/node_exporter
dest: /usr/local/bin/node_exporter
mode: '0755'
owner: monitoring
group: monitoring
remote_src: yes
- name: Create Node Exporter systemd service
copy:
content: |
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=monitoring
Group=monitoring
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--web.listen-address=:9100 \
--collector.systemd \
--collector.processes
Restart=always
[Install]
WantedBy=multi-user.target
dest: /etc/systemd/system/node_exporter.service
- name: Start and enable Node Exporter
systemd:
name: node_exporter
state: started
enabled: yes
daemon_reload: yes
- name: Install Filebeat for log shipping
get_url:
url: https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-8.8.0-linux-x86_64.tar.gz
dest: /tmp/filebeat.tar.gz
- name: Extract Filebeat
unarchive:
src: /tmp/filebeat.tar.gz
dest: /opt
remote_src: yes
- name: Create Filebeat symlink
file:
src: /opt/filebeat-8.8.0-linux-x86_64
dest: /opt/filebeat
state: link
- name: Configure Filebeat
template:
src: filebeat.yml.j2
dest: /opt/filebeat/filebeat.yml
owner: root
group: root
mode: '0600'
notify: restart filebeat
- name: Create Filebeat systemd service
copy:
content: |
[Unit]
Description=Filebeat
After=network.target
[Service]
Type=simple
User=root
Group=root
ExecStart=/opt/filebeat/filebeat -c /opt/filebeat/filebeat.yml
Restart=always
[Install]
WantedBy=multi-user.target
dest: /etc/systemd/system/filebeat.service
- name: Start and enable Filebeat
systemd:
name: filebeat
state: started
enabled: yes
daemon_reload: yes
# handlers/main.yml
---
- name: restart docker
systemd:
name: docker
state: restarted
- name: restart filebeat
systemd:
name: filebeat
state: restarted
Working with Innoworks for DevOps Implementation
At Innoworks, we understand that successful DevOps implementation requires more than just tools—it requires cultural transformation, process optimization, and continuous improvement. Our comprehensive approach to DevOps helps organizations accelerate software delivery while maintaining the highest standards of quality, security, and reliability.
Our DevOps Expertise
End-to-End Pipeline Design: We design and implement comprehensive CI/CD pipelines that automate the entire software delivery process from code commit to production deployment, reducing time-to-market and improving quality.
Infrastructure as Code Mastery: Our team specializes in IaC practices using Terraform, Ansible, and cloud-native tools to create repeatable, version-controlled infrastructure that scales with your business needs.
Cloud-Native DevOps: We implement DevOps practices optimized for cloud platforms including AWS, Azure, and GCP, leveraging managed services and cloud-native tools for maximum efficiency.
Rapid Implementation: Utilizing our proven 8-week development cycles, we help organizations quickly establish DevOps practices and see immediate improvements in deployment frequency and reliability.
Comprehensive DevOps Services
- DevOps Strategy and Assessment
- CI/CD Pipeline Design and Implementation
- Infrastructure as Code (IaC) Development
- Container Orchestration with Kubernetes
- Monitoring and Observability Solutions
- Security Integration (DevSecOps)
- Cloud Migration and Optimization
- Team Training and Cultural Transformation
Get Started with DevOps Implementation
Ready to transform your software delivery process with modern DevOps practices? Contact our DevOps experts to discuss your DevOps requirements and learn how we can help you implement CI/CD pipelines, infrastructure automation, and monitoring solutions that accelerate your development cycles while improving quality and reliability.
Accelerate software delivery with proven DevOps practices. Partner with Innoworks to implement CI/CD pipelines, infrastructure automation, and monitoring solutions that enable rapid, reliable software delivery at scale.