The tale of unexpected tech hiccups: How a Simple Command can almost Break your System

Published in

Westwing Tech Blog

4 min readJan 16, 2024

App throughput abruptly stopped at 4 30 pm

Introduction:

At Westwing, where innovation and excellence converge, we often find ourselves navigating the intricate landscape of software development. In this chronicle, we unfold the narrative of an unforeseen tech hiccup — the ‘Console.log Conundrum.’ An incident that unfolded like a puzzle, revealing the unexpected challenges lurking within seemingly innocuous lines of code.

The Initial Puzzle:

It all started with a red alert — CPU usage for our mobile apps k8s pods exceeded acceptable levels. Our go-to monitoring tools, Grafana and NewRelic, presented a perplexing scenario: dwindling traffic and dropping CPU usage. Something deeper was at play, and we were determined to get to the bottom of it.

The Decisive Moment:

As we delved into the first mystery, another alert shook our focus — k8s_bff_Apps_pod_cpu_utilization was spiking. Quick thinking led us to increase manually k8s replica count, adding additional pods into the Mobile application Backend and increasing the number of microservice pods from 4 to 6. After bumping pods from 4 to 6, I found that something was suspicious, CPU usage was not dropping. Little did we know that this decision would set the stage for an unforeseen storm.

The Unseen Threat:

Pod readiness checks revealed an alarming reality — pods were starting logs were growing and then they were evicted, but it took probably hours between start and eviction. numerous app pods were evicted one after the other, leaving only one survivor. Panic set in as the last stable Apps backend microservice pod succumbed at exactly 17:18. Adding to the complexity, the K8s cluster events showed Calico pull errors for the docker.io/calico/pod2daemon-flexvol:v3.25.1 image. The pods that were evicted were gone, new pods cannot be spawned due to missing calico image. Calico primarily focuses on networking and network security for containerized workloads and does not directly handle the replication or management of pods in the context of Kubernetes. However, it does play a crucial role in ensuring the proper communication and connectivity between pods, which indirectly contributes to the reliability and availability of replicated pods

The Race Against Downtime:

With the downtime clock ticking, we had to act swiftly. We downsized other parts of our system and managed to bring the Apps backend back online. The entire ordeal, though intense, concluded in less than 30 minutes.

The Unexpected Culprit:

Indeed, our investigation unveiled multiple contributing factors to the incident. One significant issue arose from errors related to the Calico image, leading to the unsuccessful creation of replica pods and subsequent evictions. This complication hindered the proper functioning of our application, emphasizing the critical role of a stable networking infrastructure in the overall reliability of our system.

The other one and most interesting one was a seemingly harmless console.log. A day prior, during a live issue, we deployed a hotfix to debug the problem successfully. However, in the rush to address the immediate concern, we inadvertently left behind the debug logs, leading to the accumulation of data and subsequent pod space issues. Now, you might be wondering why we deployed a hotfix in the first place. The urgency of the situation demanded a swift resolution to a critical issue affecting our live environment. In such cases, deploying a hotfix is often the fastest way to address the problem and restore normalcy. Unfortunately, amidst the urgency, we neglected to remove the hotfix deployment, causing unintended consequences.

Additional Lessons Learned: As we reflected on this incident, several valuable lessons came to light:

Mindful Hotfix Cleanup: While hotfixes are essential for quick issue resolution, it’s equally crucial to prioritize cleanup. Ensure that temporary code and debug logs are promptly removed after their purpose is served.
Proactive Issue Management: Keep a close eye on your production environment, and promptly address critical issues. However, always maintain a balance between urgency and diligence to prevent oversights.
Thorough Cluster Monitoring: Regularly monitor your Kubernetes cluster for events that could impact pod stability. In our case, the Calico pull errors for the docker.io/calico/pod2daemon-flexvol:v3.25.1 image added an extra layer of complexity, emphasizing the importance of staying vigilant in your cluster’s health.
Communication and Coordination: In high-pressure situations, effective communication and coordination among team members are paramount. Everyone should be on the same page regarding the status and necessary follow-up actions.
Importance of Alerts: Incorporating PingDom alerts, especially for monitoring pod counts falling below the expected threshold, proves invaluable. Diversifying alert configurations further enhances system stability and availability, providing proactive insights to promptly address potential issues.
Limit log size: Implementing size limits and a rotation mechanism to address log accumulation beyond predefined thresholds is vital. This practice not only ensures efficient resource utilization but also contributes to the overall health and resilience of the system

Conclusion:

In the dynamic world of software development, our journey highlights the critical role of timely alerts, efficient log management, and proactive issue resolution. As we dissect the ‘Console.log Conundrum,’ the importance of balancing immediate concerns with a broader system perspective becomes evident. Let this experience serve as a reminder of the ever-present need for resilient practices in the face of unexpected challenges, propelling us toward a future of more robust and adaptive software development.