|Type||Description||Tested K8s Platform|
|Generic||External disk loss from the node||GKE, AWS (KOPS)|
- Ensure that the Litmus Chaos Operator is running by executing
kubectl get podsin operator namespace (typically,
litmus). If not, install from here
- Ensure that the
disk-lossexperiment resource is available in the cluster by
kubectl get chaosexperimentsin the desired namespace. If not, install from here
- Ensure to create a Kubernetes secret having the gcloud/aws access configuration(key) in the namespace of
- There should be administrative access to the platform on which the cluster is hosted, as the recovery of the affected node could be manual. Example gcloud access to the project
apiVersion: v1 kind: Secret metadata: name: cloud-secret type: Opaque stringData: cloud_config.yml: |- # Add the cloud AWS credentials or GCP service account respectively
- The disk is healthy before chaos injection
- The disk is healthy post chaos injection
APP_CHECKis true, the application pod health is checked post chaos injection
- In this experiment, the external disk is detached from the node for a period equal to the
- This chaos experiment is supported on GKE and AWS platforms.
- If the disk is created as part of dynamic persistent volume, it is expected to re-attach automatically. The experiment re-attaches the disk if it is not already attached.
Note: Especially with mounted disk. The remount of disk is a manual step that the user has to perform.
- Disk loss is effected using the litmus chaoslib that internally makes use of the aws/gcloud commands
Steps to Execute the Chaos Experiment
This Chaos Experiment can be triggered by creating a ChaosEngine resource on the cluster. To understand the values to provide in a ChaosEngine specification, refer Getting Started
Follow the steps in the sections below to create the chaosServiceAccount, prepare the ChaosEngine & execute the experiment.
- Use this sample RBAC manifest to create a chaosServiceAccount in the desired (app) namespace. This example consists of the minimum necessary role permissions to execute the experiment.
Sample Rbac Manifest
apiVersion: v1 kind: ServiceAccount metadata: name: disk-loss-sa namespace: default labels: name: disk-loss-sa apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: disk-loss-sa labels: name: disk-loss-sa rules: - apiGroups: ["","litmuschaos.io","batch"] resources: ["pods","jobs","secrets","events","pods/log","chaosengines","chaosexperiments","chaosresults"] verbs: ["create","list","get","patch","update","delete"] apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: disk-loss-sa labels: name: disk-loss-sa roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: disk-loss-sa subjects: - kind: ServiceAccount name: disk-loss-sa namespace: default
- Provide the application info in
- Provide the auxiliary applications info (ns & labels) in
- Override the experiment tunables if desired in
- To understand the values to provided in a ChaosEngine specification, refer ChaosEngine Concepts
Supported Experiment Tunables for application
|Parameter||Description||Specify In ChaosEngine||Notes|
|CLOUD_PLATFORM||Cloud Platform name||Mandatory||Supported platforms: GKE, AWS|
|PROJECT_ID||GCP project ID, leave blank if it's AWS||Mandatory|
|NODE_NAME||Node name of the cluster||Mandatory|
|DISK_NAME||Disk Name of the node, it must be an external disk.||Mandatory|
|DEVICE_NAME||Enter the device name which you wanted to mount only for AWS.||Mandatory|
|ZONE_NAME||Zone Name for GCP and region name for AWS||Mandatory||Note: Use REGION_NAME for AWS|
|TOTAL_CHAOS_DURATION||The time duration for chaos insertion (sec)||Optional||Defaults to 15s|
|APP_CHECK||If it checks to true, the experiment will check the status of the application.||Optional|
|RAMP_TIME||Period to wait before injection of chaos in sec||Optional|
|INSTANCE_ID||A user-defined string that holds metadata/info about current run/instance of chaos. Ex: 04-05-2020-9-00. This string is appended as suffix in the chaosresult CR name.||Optional||Ensure that the overall length of the chaosresult CR is still < 64 characters|
Sample ChaosEngine Manifest
apiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: nginx-chaos namespace: default spec: # It can be true/false annotationCheck: 'false' # It can be active/stop engineState: 'active' #ex. values: ns1:name=percona,ns2:run=nginx auxiliaryAppInfo: '' appinfo: appns: 'default' applabel: 'app=nginx' appkind: 'deployment' chaosServiceAccount: disk-loss-sa monitoring: false # It can be retain/delete jobCleanUpPolicy: 'delete' experiments: - name: disk-loss spec: components: env: # set chaos duration (in sec) as desired - name: TOTAL_CHAOS_DURATION value: '60' # set cloud platform name - name: CLOUD_PLATFORM value: 'GKE' # set app_check to check application state - name: APP_CHECK value: 'true' # GCP project ID - name: PROJECT_ID value: 'litmus-demo-123' # Node name of the cluster - name: NODE_NAME value: 'demo-node-123' # Disk Name of the node, it must be an external disk. - name: DISK_NAME value: 'demo-disk-123' # Enter the device name which you wanted to mount only for AWS. - name: DEVICE_NAME value: '/dev/sdb' # Name of Zone in which node is present (GCP) # Use Region Name when running with AWS (ex: us-central1) - name: ZONE_NAME value: 'us-central1-a'
Create the ChaosEngine Resource
- Create the ChaosEngine manifest prepared in the previous step to trigger the Chaos.
kubectl apply -f chaosengine.yml
- If the chaos experiment is not executed, refer to the troubleshooting section to identify the root cause and fix the issues.
Watch Chaos progress
- Setting up a watch of the app which is using the disk in the Kubernetes Cluster
watch -n 1 kubectl get pods
Check Chaos Experiment Result
- Check whether the application is resilient to the disk loss, once the experiment (job) is completed. The ChaosResult resource name is derived like this: <ChaosEngine-Name>-<ChaosExperiment-Name>.
kubectl describe chaosresult nginx-chaos-disk-loss -n <CHAOS_NAMESPACE>
Disk Loss Experiment Demo [TODO]
- A sample recording of this experiment execution is provided here.