Weekly DevOps Roundup: Supply Chain Guardrails and Agent Ops
This week's DevOps roundup centers on supply chain defense, with new npm compromises (including Shai-Hulud variants) reinforcing the need for safer publishing and install defaults, plus fast secret rotation and endpoint hunting when incidents land. We also saw practical hardening lessons from GitHub Actions and extension supply chain incidents, alongside GitHub platform changes that improve auditability (issue fields, OIDC expansion, and API behavior updates). On the operations side, Copilot and VS Code agent workflows moved closer to day-to-day incident response, while Azure updates covered GitOps in AKS, more control over autoscaling, and patching at scale with Arc. The thread running through it all is treating automation and agents as production attack surface, then backing that up with instrumentation, governance, and repeatable controls.
This Week's Overview
- Supply chain security: npm attacks and tighter publishing/install controls
- CI/CD exposure: GitHub Actions and extension supply chain incidents
- GitHub platform updates for workflow automation and security posture
- Copilot and VS Code agents move closer to day-to-day operations
- Azure cloud operations: GitOps, autoscaling controls, and patching at scale
- Observability and reliability for agents and AI workloads
- DevOps patterns for AI: governance, evals, and safe deployment architectures
- Other DevOps News
Supply chain security: npm attacks and tighter publishing/install controls
This week brought a blunt reminder that the JavaScript supply chain is still a high-value target, reinforcing last week's DevOps focus on guardrails (least privilege, auditable tooling, and safer defaults) as the only sustainable way to run modern automation. DevClass reported another wave of the “Shai-Hulud” campaign where a compromised npm account pushed malicious updates into 314 packages, then tried to reduce visibility by closing GitHub issues that flagged the behavior. The reported payload focused on credential theft and used GitHub for exfiltration/command-and-control patterns, with remediation advice centered on rotating secrets, checking for unexpected repos, and auditing systems for persistence (including suspicious systemd services).
Microsoft Defender Security Research also detailed a related “Mini Shai Hulud” compromise affecting @antv packages that executed during install to steal CI/CD secrets, with a specific focus on GitHub Actions runners on Linux. Their write-up includes indicators of compromise (IOCs) and Microsoft Defender XDR hunting queries, which is useful if you need to scope exposure across developer endpoints and build agents.
On the preventative side, npm shipped two concrete hardening features. Staged publishing is now generally available, and npm CLI 11.15.0 adds install-time allowlist flags to control non-registry dependency sources, helping teams reduce risk from dependencies fetched via git URLs or other external sources. GitHub recommends pairing staged publishing with OIDC trusted publishing and maintainer approval backed by 2FA, which is a practical set of controls when multiple maintainers or automation publish on your behalf.
- Shai-Hulud keeps burrowing: 314 npm packages infected after another account compromise
- Mini Shai Hulud: Compromised @antv npm packages enable CI/CD credential theft
- Staged publishing and new install-time controls for npm
CI/CD exposure: GitHub Actions and extension supply chain incidents
Two separate incident threads landed in the same week, and both point to the same operational takeaway: treat developer tooling (extensions, CI triggers, caches) as production attack surface, a practical extension of last week's emphasis on reviewing and constraining automation actors (including agents) as if they were production systems. GitHub disclosed that a poisoned VS Code extension led to exfiltration of internal repositories, raising the likelihood of credential exposure and follow-on compromise. If your org depends heavily on VS Code extensions, this is another reason to lock down extension sources, review publisher trust, and enforce secret scanning and rapid secret rotation playbooks.
TanStack's incident follow-up focused on CI mechanics rather than extensions, describing how a malicious pull request abused GitHub Actions pull_request_target and shared caches. Their mitigations map to actionable hardening steps most teams can copy: remove pull_request_target where possible, disable or scope caches to avoid poisoning, and pin GitHub Actions to commit SHAs (instead of mutable tags). TanStack is even considering invitation-only pull requests to reduce the attack surface from unsolicited contributions, which will resonate with maintainers of popular repos.
- GitHub says internal repos exfiltrated after poisoned VS Code extension attack
- TanStack weighs invitation-only pull requests after supply chain attack
GitHub platform updates for workflow automation and security posture
Several GitHub updates this week targeted the boring-but-important work of making automation safer and more auditable across large organizations, following directly from last week's theme that small platform and policy shifts (rulesets, scanning surfaces, API changes) can either strengthen or quietly break operational workflows. Issue fields moved into public preview for all orgs, adding typed, org-wide metadata that you can search, show in project views, and automate via REST/GraphQL APIs and webhooks. For DevOps teams, that unlocks more consistent triage and routing (for example, required fields like service name, environment, severity, or owning team that Actions can validate before a workflow runs).
Security and compliance controls got a few notable tweaks. Dependabot will deprecate Python 3.9 support on June 23, 2026, which can stop update PRs for repos still pinned to 3.9, so now is the time to move to a supported runtime in your automation. GitHub also expanded OIDC authentication support for Dependabot and code scanning to more org-level private registries (Cloudsmith and Google Artifact Registry), with GitHub Enterprise Server 3.22 called out as the planned on-prem version.
There were also two small but operationally relevant changes for integrators. GitHub removed the code_scanning_upload field from the /rate_limit REST API response, so any tooling parsing that field needs to switch to the core rate limit pool logic (this is the deprecation we flagged last week now landing as an actual behavior change). And enterprise admins can now start a GitHub Advanced Security trial directly from Secret Protection or Code Security risk assessments, which streamlines evaluation in orgs that want to trial features based on measured exposure.
- Issue fields are now in public preview for all organizations
- Upcoming deprecation of Python 3.9 for Dependabot
- Expanded OIDC support for Dependabot and code scanning
- Removal of code_scanning_upload field from rate_limit API endpoint
- Start a GitHub Advanced Security trial from a risk assessment
Copilot and VS Code agents move closer to day-to-day operations
A consistent theme this week was operationalizing agent workflows: not just “AI helps write code”, but “AI participates in debugging, CI remediation, and governed automation”, building on last week's push to treat agents as first-class automation actors with reviewable outputs and auditable configuration. GitHub shipped general availability for remotely controlling Copilot CLI sessions from github.com and GitHub Mobile, and said similar remote control is coming to VS Code and JetBrains. That matters when Copilot is running long-lived tasks (scaffolding, refactors, PR prep) and you need to steer or approve outcomes away from your laptop, including pushing a PR workflow forward from web or mobile.
In GitHub Actions, Copilot cloud agent gained a “Fix with Copilot” button for failing jobs (Copilot Business and Copilot Enterprise). The workflow is explicitly operational: the agent investigates the failure, pushes a branch with a proposed fix, and requests review, which keeps a human in the loop while still reducing time-to-first-patch. GitHub also added a public preview REST API to audit Copilot cloud agent configuration at the repo level, including MCP server settings, enabled tools, Actions workflow policy, and firewall configuration (useful for governance and internal reviews).
On the editor side, Visual Studio Code 1.122 and the 1.121 “what's new” coverage both leaned into agent workflows: remote task triggering, source control refresh improvements, model configuration, and better language model tooling via a model picker and provider actions. There was also a new Agent Customization window highlighted separately, plus practical demos on tracing agent sessions with OpenTelemetry and viewing traces in the .NET Aspire Dashboard (and other examples that tie agent work to Grafana/OpenTelemetry pipelines).
- Take your local GitHub sessions anywhere
- One-click fixes for failing Actions with Copilot cloud agent
- Audit repository Copilot cloud agent configuration via the REST API
- Visual Studio Code 1.122
- Visual Studio Code and GitHub Copilot - What's new in 1.121
- The New Agent Customization Window in VS Code
- Tracing Agent Sessions with OpenTelemetry & Aspire
Azure cloud operations: GitOps, autoscaling controls, and patching at scale
Azure updates this week concentrated on making platform operations more “self-serve” while still keeping security boundaries clear, extending last week's Azure ops thread about scalable governance (landing zones, least-privilege identities, and structured tool access) into day-to-day cluster and fleet operations. AKS now has a public preview Argo CD extension integrated into the Azure Portal experience, aimed at guided GitOps setup and management rather than stitching together multiple entry points. The preview emphasizes Microsoft Entra ID SSO and workload identity federation (for example to Azure Container Registry and Azure DevOps), plus security hardening options, which is important if you're standardizing GitOps across multiple clusters and teams.
For serverless workloads, Azure Functions on Azure Container Apps added a high-impact preview feature: custom KEDA (Kubernetes-based Event Driven Autoscaling) scale rule overrides. Instead of being locked into platform-generated trigger rules, teams can replace them with their own KEDA scaler configuration and thresholds via the Container Apps REST API (using the allowScalingRuleOverride capability). That is a practical lever for tuning latency/cost tradeoffs and handling workload quirks (bursty queues, noisy topics, or custom metrics) without waiting for the platform's defaults to catch up.
Finally, Azure Arc expanded operational patching options: hotpatch enabled by Azure Arc is now available at no additional cost for Windows Server 2025. The post ties together eligibility, onboarding via the Azure Connected Machine agent, and orchestrating rollout through Azure Update Manager and APIs, which can simplify patch windows for hybrid fleets when downtime is hard to schedule.
- Announcing Public Preview of Argo CD extension in AKS Azure Portal Experience
- Custom KEDA Scale Rules for Azure Functions on Azure Container Apps
- Simplified access to Hotpatching enabled by Azure Arc for Windows Server 2025
Observability and reliability for agents and AI workloads
The observability story this week split into two layers: troubleshooting production systems with better interfaces, and building reliability practices specifically for agentic workloads, continuing last week's shift toward treating token spend, tool access, and agent behavior as operational concerns that need instrumentation and controls. Azure introduced a new chat experience for the Azure Copilot Observability agent in the Azure Portal, letting operators ask natural-language questions that get translated into queries across relevant telemetry sources. The point is not to replace KQL (Kusto Query Language), but to reduce the friction of correlating logs and metrics when you are mid-incident and not sure which data source to query first.
Multiple posts focused on how to measure and control AI usage and agent behavior once it is in production. One guide laid out six options for tracking Azure AI Foundry/Azure OpenAI usage, ranging from portal metrics to Managed Grafana dashboards and KQL reporting in Application Insights/Log Analytics, including Azure API Management (APIM) patterns for per-caller token tracking. Another guide applied SRE concepts to autonomous agents (Safety SLIs, autonomy/error budgets, behavioral circuit breakers, chaos experiments, replay debugging, and progressive capability rollouts), aligning agent operations with the same reliability discipline used for services.
- The Azure Copilot Observability Agent Chat - Stop Writing Queries, Start Asking Questions
- How to Visualize Your Azure AI Workloads Usage for Observability
- Applying Site Reliability Engineering to Autonomous AI Agents
DevOps patterns for AI: governance, evals, and safe deployment architectures
A cluster of posts this week treated AI as something you operate, not just something you call, which is the same arc we covered last week with MCP-based tool access, least-privilege identity for agents, and governance patterns for multi-region “agent sprawl.” Microsoft open-sourced RAMPART and Clarity to bring safety and reviewability into agent development workflows: RAMPART turns red-teaming scenarios into pytest-based CI tests, while Clarity captures design intent and failure analysis as versioned repo artifacts (via a .clarity-protocol). This is the kind of tooling that fits naturally into existing pipelines, where “safety gates” can be tested alongside unit/integration tests.
On the platform architecture side, several Azure reference designs focused on making AI deployments controllable and auditable. One pattern places an agent and an MCP (Model Context Protocol) server on Azure App Service while routing Azure OpenAI traffic through Azure API Management for policy-based auth, semantic caching, token throttling, and Application Insights metrics suitable for chargeback. Another App Service-focused post showed how MCP servers can scale behind the App Service load balancer using MCP's stateless HTTP transport, with distribution verification via Application Insights and load testing with k6.
For teams trying to keep quality and cost predictable across multiple models, a separate guide walked through running evals for the Azure AI Foundry model router, measuring quality, cost, and latency while inspecting model-selection distribution. That “router eval” approach is a practical addition to CI, especially if your routing policy changes frequently or you need to justify a cost/latency tradeoff with data.
- Introducing RAMPART and Clarity: Open source tools to bring safety into Agent development workflow
- You Can Build a Framework-Agnostic AI Gateway on Azure App Service — Here's How
- You Can Scale MCP Servers Behind a Load Balancer on App Service — Here's How
- How to run evals for the model router
Other DevOps News
Security posture improvements showed up in everyday automation guidance, including a detailed walk-through of replacing Azure Service Principal secrets with OIDC federation from GitHub Actions for Terraform deployments (with Entra ID setup and troubleshooting), which pairs naturally with last week's push toward least-privilege, short-lived credentials for both agents and pipelines. On the PowerShell side, PSResourceGet best practices reinforced treating the PowerShell Gallery as untrusted for runtime installs and recommended splitting discovery vs production repositories, alongside PSResourceGet 1.3 plans (MAR defaults, OCI/ORAS improvements, DSC v3 support, parallel installs).
Several pragmatic ops and infra guides focused on repeatable maintenance and drift control: a process for refreshing golden images for Azure VMs and VM Scale Sets using Packer variables, pipelines, and Terraform (including VMSS upgrade mode considerations), plus a Terraform drift validator that compares design docs, Terraform state, and live Azure config using Azure Resource Graph. There were also useful “do this in bulk” scripts and configs, like a PowerShell tool to bulk-configure Azure Monitor diagnostic settings for Consumption Logic Apps and a CI/CD-safe fix for App Service Easy Auth on Logic Apps Standard using authsettingsV2 excluded paths for /runtime/*.
Finally, a few workflow-focused items rounded out the week: semantic issue search in Copilot Chat (GA across Copilot plans) for faster incident/bug discovery from natural-language prompts, and GitHub's open source accessibility work (including a Copilot-powered accessibility scanner action) that can be folded into CI. If you're newer to OSS contributions, GitHub also shared a simple “good first issue” workflow, which is more process than platform change but still useful for onboarding.
- OIDC vs SPN: Securing Azure Deployments with GitHub Actions & Terraform
- PowerShell PSResource Roadmap and Best Practices
- Golden Image Refresh for Virtual Machines and VM Scale Sets: Driving Consistency at Scale
- Building a Terraform Drift Validator for Azure with Live Portal Verification
- Bulk-configure diagnostic settings on Azure Logic Apps Consumptions
- Easy Auth Configuration for Logic App Standard through CI/CD
- Semantic issue search in Copilot Chat
- Building GitHub’s next chapter in accessibility
- How to find your first open source project