Azure Sizing & GP Coexistence
Azure Sizing & GlobalProtect Coexistence — Round 2 Research Report
Section titled “Azure Sizing & GlobalProtect Coexistence — Round 2 Research Report”Session: 20260321-0115 Domain: Azure VM Sizing for 100+ Peers and GlobalProtect Coexistence During Migration Date: 2026-03-21 Round: 2 (resolving R1 conflicts) Tools Used: mcp__claude_ai_Tavily__tavily_search (12 queries), WebFetch (7 pages), Bash (1 command)
Executive Summary
Section titled “Executive Summary”Round 1 produced conflicting Azure VM sizing recommendations (B1ms at $15/mo vs B2s at $30/mo). After deeper investigation, the answer is nuanced: B1ms is sufficient for the management plane with 100-150 peers under normal conditions, but B2s provides essential headroom for relay traffic spikes and policy evaluation at scale. The key variable is not peer count but policy complexity — GitHub issue #4488 demonstrates that 70 peers with 480 policies consumed 8 vCPU due to posture check inefficiency. For GSISG’s simpler policy set (likely <20 policies), B1ms is viable, but B2s at $30/mo eliminates risk for only $15/mo more.
SQLite is adequate for 100-150 peers with modest policy sets. PostgreSQL becomes necessary at 300+ peers or when events.db bloat causes I/O contention. A separate relay server is unnecessary for this deployment — most connections will be direct P2P, and the embedded relay in the combined server handles the minority of relayed connections without issue.
Regarding GlobalProtect coexistence: the issue is real but well-understood. GP triggers NetBird’s network monitor to restart the WireGuard interface, killing active sessions. The workaround (--network-monitor=false) is safe during migration with manageable side effects. The fix (PR #5156) remains open and unmerged as of March 2026. Both VPNs CAN run simultaneously with the workaround — they do not fight over default routes or DNS by design, because NetBird uses split routing (100.64.0.0/10 overlay) while GP uses its own tunnel for corporate routes.
Question 1: Azure VM Sizing for 100-150 Peers
Section titled “Question 1: Azure VM Sizing for 100-150 Peers”The Definitive Answer: B2s ($30/mo) Is the Right Choice, B1ms Is Viable Fallback
Section titled “The Definitive Answer: B2s ($30/mo) Is the Right Choice, B1ms Is Viable Fallback”The confusion in R1 arose because both agents were partly correct but examined different aspects:
Management plane (control traffic only): B1ms (1 vCPU / 2 GB) is sufficient. The management server is a lightweight Go binary handling peer registration, policy sync, and signaling. Official docs confirm “1 CPU and 2 GB of memory” as the minimum. A Hacker News commenter reports running “1k active users setup, super efficient and stable.”
With relay traffic: This is where B1ms becomes risky. One community report noted 50% CPU with 25 peers when many connections were relayed. However, this was on older versions with Coturn TURN relay. The v0.62+ built-in relay uses QUIC, which is more CPU-efficient.
With complex policies: GitHub issue #4488 reveals the real scaling bottleneck. With 480 network policies and 70 peers, 8 vCPU was required because posture checks (OS version, NB version) used excessive regexp operations on every peer sync. This is a pathological case — GSISG will have far fewer policies.
Resource Consumption Estimates for GSISG (100-150 peers)
Section titled “Resource Consumption Estimates for GSISG (100-150 peers)”| Component | CPU Impact | RAM Impact | Notes |
|---|---|---|---|
| Management server (peer sync) | ~5-15% of 1 vCPU | ~200-400 MB | Scales with peer count * policy count |
| Signal server (connection setup) | <5% of 1 vCPU | ~50-100 MB | Only active during connection establishment |
| Embedded relay (relayed traffic) | 0-30% of 1 vCPU | ~50-150 MB | Depends on how many peers relay; see Q3 |
| Dashboard + Traefik | <5% of 1 vCPU | ~100-200 MB | Static serving, minimal impact |
| SQLite I/O | Minimal | N/A | See Q4 |
| Total estimated | ~15-50% of 1 vCPU | ~400-850 MB | Under normal operation |
| Peak (all peers reconnecting) | ~80-100% of 1 vCPU | ~1-1.5 GB | After server restart or network event |
Why B2s Wins
Section titled “Why B2s Wins”| Factor | B1ms (1 vCPU / 2 GB) | B2s (2 vCPU / 4 GB) |
|---|---|---|
| Normal operation | Sufficient | Comfortable |
| Mass reconnection event | CPU-constrained; peers queue | Handles gracefully |
| Relay under load | Risk of CPU starvation | Adequate headroom |
| Future growth (150-250 peers) | Requires migration | Handles without changes |
| Docker overhead (4 containers) | Tight on 2 GB RAM | Comfortable on 4 GB RAM |
| Monthly cost (pay-as-you-go) | ~$15.11 | ~$30.37 |
| Cost difference | Baseline | +$15.23/mo ($183/yr) |
Verdict: B2s at $30/mo buys meaningful risk reduction for $15/mo more. The B-series burstable model is ideal because NetBird’s workload is bursty (low baseline, spikes during peer registration).
Scaling Path If Needed
Section titled “Scaling Path If Needed”- Start with B2s (single server, embedded relay, SQLite)
- If relay CPU becomes a concern, extract relay to separate B1ms
- If database I/O becomes a concern (300+ peers), migrate to PostgreSQL
- B2s handles up to ~300 peers comfortably; beyond that, consider B4ms or separated components
Question 2: Combined Server Binary Resource Usage (v0.62+)
Section titled “Question 2: Combined Server Binary Resource Usage (v0.62+)”What Changed in v0.62+
Section titled “What Changed in v0.62+”Before v0.62, NetBird required 7+ containers: management, signal, relay (Coturn), dashboard, Traefik, external IdP (e.g., Zitadel), and optional database. This consumed 2-4 GB RAM minimum.
Since v0.62, the unified netbird-server container combines management, signal, relay, and embedded STUN into a single Go binary. Built-in local user management eliminates the mandatory external IdP. The deployment now requires only 3-4 containers total.
Real-World Resource Reports
Section titled “Real-World Resource Reports”| Source | Scale | Infrastructure | Observation |
|---|---|---|---|
| NetBird official docs | Minimum | Any | ”1 CPU and 2 GB of memory” |
| Cloudron Forum user | ~20 peers | Hetzner CX11 (1 vCPU, 2 GB) | “Running smoothly for over a year” |
| Carl Pearson (Hetzner guide) | Small deployment | ”Even the leanest VPS is enough” | No resource issues |
| HN commenter | 1,000 active users | Not specified | ”super efficient and stable” |
| GitHub #4488 | 70 peers, 480 policies | 8 vCPU required | Pathological: posture check CPU burn |
| GitHub #1473 | 370 users, 141 groups | 16 CPU / 16 GB insufficient | SQLite locking + high policy count |
| dev.to self-hosting guide | Small/medium | VPS with 1-2 GB | Successful deployment |
Key Insight: Policy Complexity Matters More Than Peer Count
Section titled “Key Insight: Policy Complexity Matters More Than Peer Count”The 1k-user success and the 70-peer failure are not contradictory. The difference is policy complexity:
- Simple policies (e.g., “All Peers can access Routing Peer”): O(n) computation per peer sync
- Complex policies (480 policies with posture checks): O(n * p * g) where n=peers, p=policies, g=groups
For GSISG with likely 5-15 policies and 2-3 groups, the compute overhead per peer sync is trivial. The management server will idle at <10% CPU between authentication events.
Memory Breakdown (v0.62+ combined binary)
Section titled “Memory Breakdown (v0.62+ combined binary)”| Component | Estimated RAM | Basis |
|---|---|---|
| netbird-server (management + signal + relay) | 200-500 MB | Go binary, scales with peer count |
| dashboard (nginx) | 50-100 MB | Static files |
| Traefik | 50-100 MB | Reverse proxy |
| Docker overhead | 100-200 MB | Container runtime |
| Total | 400-900 MB | For 100-150 peers |
Conclusion: 2 GB RAM (B1ms) is tight but workable. 4 GB RAM (B2s) provides comfortable headroom for spikes and eliminates OOM risk.
Question 3: Embedded Relay Sufficiency
Section titled “Question 3: Embedded Relay Sufficiency”Short Answer: The Embedded Relay Is Sufficient for 100-150 Peers
Section titled “Short Answer: The Embedded Relay Is Sufficient for 100-150 Peers”A separate relay server is not needed for GSISG’s deployment. Here is the analysis:
Direct P2P vs. Relayed Connection Estimation
Section titled “Direct P2P vs. Relayed Connection Estimation”NetBird uses ICE (Interactive Connectivity Establishment) to maximize direct P2P connections. The relay is only used when both peers are behind symmetric NAT or restrictive firewalls.
GSISG’s network profile:
| User Category | Count | NAT Type | Expected Connection Type |
|---|---|---|---|
| Office workers (Honolulu, on corporate LAN) | ~60-70 | Behind pfSense, likely port-restricted NAT | P2P to other office; P2P or relay to remote |
| Office workers (Boulder, on corporate LAN) | ~20-30 | Behind pfSense, likely port-restricted NAT | P2P to other office; P2P or relay to remote |
| Remote workers (home) | ~20-30 | Home router (easy NAT / full cone) | P2P in most cases |
| Field workers (cellular) | ~5-10 | CGNAT (carrier-grade NAT) | Likely relayed |
| Azure routing peer | 1 | 1:1 NAT (cloud) | P2P to most peers |
Estimated relay percentage: Based on this profile, approximately 5-15% of active connections will require relay. This is consistent with Tailscale’s published data showing >90% direct connections in typical deployments.
Relay Bandwidth Estimation
Section titled “Relay Bandwidth Estimation”Most peers are idle (maintaining control channel only). Active data transfer occurs for ~10-15 users at any time (SMB, RDP).
| Metric | Estimate | Reasoning |
|---|---|---|
| Total active connections needing relay | 1-3 simultaneously | 5-15% of 10-15 active users |
| Average SMB/RDP bandwidth per user | 2-5 Mbps | Typical file share browsing / RDP |
| Peak relay bandwidth | 5-15 Mbps | 3 users * 5 Mbps |
| Relay CPU per Mbps (QUIC) | ~1-2% of 1 vCPU | WireGuard encryption is lightweight |
| Total relay CPU at peak | ~5-30% of 1 vCPU | Well within B2s capacity |
Why the “50% CPU with 25 peers” Report Does Not Apply
Section titled “Why the “50% CPU with 25 peers” Report Does Not Apply”The R1 report cited a user experiencing 50% CPU with 25 peers on relay. Context matters:
- That report was likely from an older version using Coturn (TCP TURN), which is heavier than the v0.62+ QUIC relay
- 25 peers ALL relaying is an unusual scenario — GSISG will have ~1-3 relayed at any time
- The report may have included management overhead, not just relay
When a Separate Relay IS Needed
Section titled “When a Separate Relay IS Needed”A separate relay server makes sense when:
- >50 peers are simultaneously relaying (unlikely for GSISG)
- Geographic distribution requires relay servers in multiple regions (GSISG is US-only, single Azure region works)
- Relay bandwidth exceeds 100 Mbps sustained (GSISG’s estimate is <15 Mbps peak)
- HA is required for the relay independently of management (not needed at this scale)
Verdict: The embedded relay in the combined server is more than sufficient. Deploying a separate relay adds cost ($15-30/mo) and complexity with no benefit for this deployment size. If relay load becomes a concern later, it can be extracted as a separate container or VM without disruption.
Question 4: SQLite vs. PostgreSQL for 100+ Peers
Section titled “Question 4: SQLite vs. PostgreSQL for 100+ Peers”Short Answer: SQLite Is Fine for 100-150 Peers; PostgreSQL Is Overkill
Section titled “Short Answer: SQLite Is Fine for 100-150 Peers; PostgreSQL Is Overkill”The R1 recommendation to use PostgreSQL was based on generic guidance. After examining actual NetBird-specific evidence:
Evidence Analysis
Section titled “Evidence Analysis”| Source | Peer Count | Database | Result |
|---|---|---|---|
| GitHub #1473 (original report) | 11 users, 10 peers | SQLite | events.db reached 1 GB; slow login performance |
| GitHub #1473 (follow-up) | 370 users, 141 groups | SQLite | Insufficient performance; requested PostgreSQL |
| GitHub #4488 | 70 peers, 480 policies | SQLite (implicit) | CPU issue, not database issue |
| Headscale benchmark (#2001) | 600 clients | SQLite vs PostgreSQL | PostgreSQL recommended for >500 clients |
| HN commenter | 1,000 users | Unknown | ”super efficient and stable” |
| NetBird official docs | Any | SQLite default | ”optional” PostgreSQL migration |
The Real Bottleneck: events.db Bloat
Section titled “The Real Bottleneck: events.db Bloat”The primary SQLite issue is not peer count but events.db growth. NetBird logs every event (peer connect, disconnect, policy change, etc.) to a SQLite database. With 100+ peers connecting/disconnecting throughout the day, events.db can grow to multiple GB within months.
Impact: Large events.db causes:
- Slow dashboard loading (activity page queries entire events table)
- SQLite write locking during event insertion (single-writer limitation)
- Disk I/O spikes during periodic event purges
Mitigation without PostgreSQL:
- Periodically archive/truncate events.db (simple cron job)
- Use WAL mode for SQLite (NetBird enables this by default)
- Place SQLite on SSD (Azure Premium SSD is default)
When PostgreSQL Becomes Necessary
Section titled “When PostgreSQL Becomes Necessary”| Threshold | Symptom | Action |
|---|---|---|
| <150 peers, <20 policies | No issues | Stay on SQLite |
| 150-300 peers, moderate policies | Occasional slow dashboard loads | Consider PostgreSQL or events cleanup |
| 300+ peers OR 100+ policies | Consistent login delays, SQLite locking | Migrate to PostgreSQL |
| 500+ peers | Management server CPU spikes on peer sync | PostgreSQL + consider component separation |
PostgreSQL Cost (If Needed Later)
Section titled “PostgreSQL Cost (If Needed Later)”| Option | Monthly Cost | Notes |
|---|---|---|
| Co-located on same VM | $0 (already in B2s) | Simplest; some resource contention |
| Azure Database for PostgreSQL Flexible (Burstable B1ms) | ~$12-15/mo | Managed; automatic backups |
| Azure Database for PostgreSQL Flexible (GP D2s_v3) | ~$100/mo | Overkill for NetBird |
Verdict: Start with SQLite. Monitor events.db size monthly. Implement events cleanup script. Migrate to PostgreSQL only if symptoms appear, which is unlikely at 100-150 peers.
Question 5: Azure Cost Breakdown
Section titled “Question 5: Azure Cost Breakdown”Recommended Configuration: Single B2s VM
Section titled “Recommended Configuration: Single B2s VM”| Component | Specification | Monthly Cost | Annual Cost |
|---|---|---|---|
| VM: Standard_B2s | 2 vCPU, 4 GB RAM, Linux | $30.37 | $364.44 |
| OS Disk: P4 Premium SSD | 32 GB (sufficient for OS + Docker + NetBird) | $5.28 | $63.36 |
| Public IP: Standard Static IPv4 | Required for peer connectivity | $3.65 | $43.80 |
| Bandwidth (egress) | Estimated 5-10 GB/mo (signaling + relay) | $0.00-$0.44 | $0.00-$5.28 |
| Total (pay-as-you-go) | $39.30-$39.74 | $471.60-$476.88 |
With Reserved Instance Pricing
Section titled “With Reserved Instance Pricing”| Commitment | VM Monthly | VM Annual | Total Monthly (all components) | Total Annual |
|---|---|---|---|---|
| Pay-as-you-go | $30.37 | $364.44 | ~$39.30 | ~$471.60 |
| 1-year reserved (~37% savings on VM) | ~$19.13 | ~$229.56 | ~$28.06 | ~$336.72 |
| 3-year reserved (~60% savings on VM) | ~$12.15 | ~$145.80 | ~$21.08 | ~$252.96 |
Bandwidth Deep Dive
Section titled “Bandwidth Deep Dive”NetBird management traffic is minimal. Here is the breakdown:
| Traffic Type | Direction | Monthly Volume | Cost |
|---|---|---|---|
| Peer signaling (100-150 peers) | Egress | ~100-500 MB | Free (within 100 GB free tier) |
| Relay traffic (5-15% of active users) | Egress | ~2-5 GB | Free (within 100 GB free tier) |
| STUN responses | Egress | ~10-50 MB | Free |
| Dashboard/API | Egress | ~50-200 MB | Free |
| Total egress | ~2-6 GB/mo | $0.00 (within free 100 GB) |
Key finding: Azure’s first 100 GB/mo of egress from North America is free. NetBird’s total egress for 100-150 peers is well under this threshold. Bandwidth is effectively free.
Comparison with R1 Estimates
Section titled “Comparison with R1 Estimates”| R1 Agent | Recommended Config | Monthly Cost |
|---|---|---|
| R1 Agent A (cost report) | B1ms + separate components | ~$29.76 |
| R1 Agent B (platform report) | B2s + separate relay + PostgreSQL | ~$70-85 |
| R2 (this report) | B2s single VM, SQLite, embedded relay | ~$39.30 |
The R2 recommendation splits the difference: uses the safer B2s VM but avoids the unnecessary cost of separate relay servers and managed PostgreSQL. The total is $39.30/mo pay-as-you-go or ~$28/mo with 1-year reserved pricing.
Question 6: GlobalProtect + NetBird Coexistence (GitHub #5077)
Section titled “Question 6: GlobalProtect + NetBird Coexistence (GitHub #5077)”What Exactly Happens
Section titled “What Exactly Happens”The problem is precisely documented in GitHub issue #5077 (opened January 9, 2026, status: OPEN).
When a user activates GlobalProtect VPN while NetBird is already running:
- GlobalProtect creates a new virtual network adapter (“PANGP Virtual Ethernet Adapter Secure”)
- GP adds a default route (or modifies the routing table) via this adapter
- NetBird’s network monitor detects the route change as a “significant network change”
- NetBird’s engine interprets this as a network switch (e.g., Wi-Fi to Ethernet) and initiates a full client restart
- The restart tears down the WireGuard interface (
wt0), destroying all active TCP sessions - NetBird recreates the interface and reconnects, but established SSH/RDP sessions are lost
Root cause: NetBird’s network monitor on Windows watches for default route changes as a signal that the network environment has changed. It does not distinguish between a physical network change (switching Wi-Fi networks) and a VPN adding a virtual adapter with its own routes.
The --network-monitor=false Flag
Section titled “The --network-monitor=false Flag”This flag disables NetBird’s network change detection entirely. When set:
What it disables:
- Detection of default route changes
- Automatic WireGuard interface restart when network changes occur
- Recovery logic that re-establishes connections after network switches
Side effects of disabling:
| Scenario | With network-monitor (default) | With —network-monitor=false |
|---|---|---|
| Switch Wi-Fi to Ethernet | Auto-reconnects within seconds | May take 25+ seconds (WireGuard keepalive timeout) |
| Switch between Wi-Fi networks | Auto-reconnects within seconds | May lose connection; manual netbird down && netbird up needed |
| VPN connects/disconnects | Interface restart (the bug) | No disruption (desired behavior) |
| Resume from sleep/hibernate | Quick reconnection | May require manual reconnection |
| IP address change (DHCP renewal) | Detected and handled | May not recover automatically |
Assessment for GSISG migration: The side effects are manageable. Most users are on stable office Ethernet or home Wi-Fi — they rarely switch networks during the workday. The 25-second WireGuard keepalive timeout provides passive reconnection for most scenarios. Field workers on cellular may need to manually toggle NetBird occasionally.
PR Status (as of March 2026)
Section titled “PR Status (as of March 2026)”| PR | Title | Status | Notes |
|---|---|---|---|
| #5155 | ”Check Windows Interfaces by Name and Description” | CLOSED (not merged) | Code review found test failures and regression risk |
| #5156 | ”Change Soft Interface detection to include Interface Description” | OPEN (not merged) | Actively addresses #5077; passed SonarQube quality gate; awaiting maintainer approval |
PR #5156 approach: Extends Windows network monitor to check both interface Name and Description against a list of known soft/virtual adapter keywords. “pangp” (Palo Alto Networks GlobalProtect) is added to this list, so route changes from GP’s virtual adapter are ignored.
No version includes the fix yet. The flag --network-monitor=false remains the only workaround. When PR #5156 eventually merges, it will likely appear in a v0.67.x or v0.68.x release.
Question 7: Routing and DNS Conflict Analysis
Section titled “Question 7: Routing and DNS Conflict Analysis”Can Both Tunnel Interfaces Run Simultaneously?
Section titled “Can Both Tunnel Interfaces Run Simultaneously?”Yes, with important caveats.
Routing Table Coexistence
Section titled “Routing Table Coexistence”GlobalProtect and NetBird use fundamentally different routing approaches:
| Aspect | GlobalProtect | NetBird |
|---|---|---|
| Interface name | ”PANGP Virtual Ethernet Adapter” | wt0 (WireGuard) |
| Address space | Corporate subnets (e.g., 10.x.x.x) | Overlay 100.64.0.0/10 |
| Routing mode | Split-tunnel or full-tunnel (configurable by admin) | Split routing (only NetBird overlay + configured routes) |
| Default route | May add 0.0.0.0/0 (full-tunnel mode) | Does NOT add default route (unless exit node is configured) |
If GlobalProtect is in split-tunnel mode: No routing conflict. GP routes corporate traffic (e.g., 10.100.7.0/24, 10.15.0.0/24) and NetBird routes overlay traffic (100.64.0.0/10). The routes are non-overlapping and coexist in the routing table.
If GlobalProtect is in full-tunnel mode: GP claims the default route (0.0.0.0/0). This means:
- All internet traffic goes through GP
- NetBird’s WireGuard traffic to its management server/relay goes through GP’s tunnel (if the management server IP is not excluded)
- This may cause MTU issues (double encapsulation: WireGuard inside GP’s IPsec/SSL tunnel)
- NetBird’s specific overlay routes (100.64.0.0/10) still work because they are more specific than 0.0.0.0/0
Recommendation: Ensure GlobalProtect is in split-tunnel mode during migration. If it must be full-tunnel, add NetBird management server IP to GP’s split-tunnel exclusion list to avoid double encapsulation.
DNS Coexistence
Section titled “DNS Coexistence”| Aspect | GlobalProtect | NetBird |
|---|---|---|
| DNS modification | Pushes corporate DNS servers to client | Configures match-domain DNS (split DNS) |
| DNS scope | May override all DNS or just internal domains | Only manages DNS for configured match domains |
| Conflict potential | Medium | Low (if using match domains) |
Potential conflict: If GP pushes DNS servers that override the system default, and NetBird also configures DNS for match domains, the order of DNS resolver configuration matters. On Windows, the “more specific” configuration typically wins for match domains, but GP’s DNS push can occasionally override NetBird’s settings.
Mitigation: During migration, configure NetBird with match-domain DNS (e.g., *.company.internal -> internal DNS) and leave primary DNS to GP or the system default. This avoids conflicts because NetBird only intercepts DNS for its configured domains.
They Do NOT Fight Over Default Routes
Section titled “They Do NOT Fight Over Default Routes”This is a critical clarification. NetBird, by default, does NOT add a default route. It only adds routes for:
- The NetBird overlay network (100.64.0.0/10) — peer-to-peer traffic
- Configured network routes (e.g., 10.100.7.0/24 via a routing peer)
These are specific routes that coexist peacefully with GP’s routing. The conflict in issue #5077 is NOT a routing conflict — it is a network monitor detection issue. The routes themselves do not fight.
Question 8: Recommended Migration Sequence
Section titled “Question 8: Recommended Migration Sequence”Phase 1: Install NetBird (GP Remains Primary) — Week 1-2
Section titled “Phase 1: Install NetBird (GP Remains Primary) — Week 1-2”- Deploy NetBird management server on Azure B2s
- Configure Entra ID integration and test with 3-5 pilot users
- Install NetBird client on pilot machines WITH
--network-monitor=falseflag - Verify NetBird overlay connectivity (ping between peers on 100.x.x.x addresses)
- GP continues handling all production VPN traffic — no changes to GP
Validation checklist:
- NetBird peers show “Connected” in dashboard
- Peer-to-peer connectivity verified (netbird status -d shows P2P connections)
- GP sessions remain stable (no SSH/RDP drops from issue #5077)
- DNS resolution works for both GP and NetBird domains
Phase 2: Configure NetBird Routing (GP Still Active) — Week 2-3
Section titled “Phase 2: Configure NetBird Routing (GP Still Active) — Week 2-3”- Deploy routing peers at Honolulu and Boulder offices
- Configure Network Routes for office subnets (10.100.7.0/24, 10.15.0.0/24)
- Add routes to pilot users’ NetBird configuration
- Pilot users now have TWO paths to office resources: GP and NetBird
- Test accessing critical resources (file shares, printers, internal apps) via NetBird routes
- Compare performance and reliability between GP and NetBird paths
Key test: Disconnect GP on a pilot machine. Verify all office resources are accessible via NetBird alone. Reconnect GP. Verify both still work.
Phase 3: Expand to All Users (GP as Fallback) — Week 3-5
Section titled “Phase 3: Expand to All Users (GP as Fallback) — Week 3-5”- Deploy NetBird to all endpoints via TacticalRMM (MSI silent install with
--network-monitor=false) - Configure all users with NetBird routes to office resources
- GP remains installed and functional on all machines
- Communicate to users: “NetBird is your new VPN. If you have issues, GlobalProtect still works as backup.”
- Monitor NetBird dashboard for connection issues, relay percentage, DNS problems
Phase 4: Disable GP (NetBird Primary) — Week 5-8
Section titled “Phase 4: Disable GP (NetBird Primary) — Week 5-8”- After 2+ weeks of stable NetBird operation with all users:
- Disable GP’s auto-connect on endpoints (but do NOT uninstall)
- Remove
--network-monitor=falseflag IF PR #5156 has been merged (check NetBird release notes) - If PR not merged, keep the flag — it is safe for production
- Monitor for 2 more weeks
Phase 5: Decommission GP — Week 8+
Section titled “Phase 5: Decommission GP — Week 8+”- Uninstall GlobalProtect client from all endpoints (via TacticalRMM)
- Remove
--network-monitor=falseflag if still set (no GP = no trigger) - Power down PA-2020 but keep configuration backed up
- Update documentation and close migration project
Critical Rules
Section titled “Critical Rules”- NEVER remove GP before NetBird is verified working for all users — always have a fallback
- Always use
--network-monitor=falsewhile both VPNs are installed — this is non-negotiable - Test rollback before expanding beyond pilot group — verify GP still works after NetBird is installed
- Field workers on cellular should be in a later wave — they are most likely to need relay and most affected by network-monitor=false
- Keep PA-2020 powered on for 30 days after full migration — emergency fallback
Consolidated Recommendations
Section titled “Consolidated Recommendations”Azure Infrastructure
Section titled “Azure Infrastructure”| Decision | Recommendation | Rationale |
|---|---|---|
| VM size | B2s (2 vCPU, 4 GB) | $15/mo more than B1ms; eliminates scaling risk |
| Database | SQLite (default) | Adequate for 100-150 peers; migrate to PostgreSQL only if symptoms appear |
| Relay | Embedded (in combined server) | Separate relay unnecessary; <15% of connections will relay |
| Region | West US 2 | Lowest latency to both Hawaii and Colorado |
| Monthly cost | ~$39/mo (pay-as-you-go) or ~$28/mo (1-year reserved) | |
| Reserved instance | 1-year commitment recommended | 28% savings; low commitment risk |
GlobalProtect Coexistence
Section titled “GlobalProtect Coexistence”| Decision | Recommendation | Rationale |
|---|---|---|
| Coexistence approach | Install NetBird alongside GP; both run simultaneously | Zero-risk migration path |
| Network monitor | --network-monitor=false on all clients during migration | Prevents issue #5077 (WireGuard restart) |
| Remove flag post-migration? | Yes, after GP is fully removed | No GP = no trigger for the bug |
| GP tunnel mode during migration | Split-tunnel (if configurable) | Avoids default route conflict |
| Migration waves | Pilot (5) -> Office (30) -> All office (100) -> Remote (30) -> Field (10) | Risk-graduated approach |
| Rollback time | <1 hour (just re-enable GP) | GP remains installed throughout |
Gaps & Uncertainties
Section titled “Gaps & Uncertainties”| Gap | Impact | Mitigation |
|---|---|---|
| PR #5156 merge timeline unknown | MEDIUM — determines when —network-monitor=false can be removed | Flag is safe to run indefinitely; removal is an optimization, not a requirement |
| No official NetBird sizing benchmarks | LOW — community evidence is sufficient | Monitor actual resource usage post-deployment; Azure VM resize is trivial |
| Relay bandwidth under corporate workloads not benchmarked | LOW — estimated <15 Mbps peak | Monitor relay connections in NetBird dashboard; extract relay if >30% connections relay |
| SQLite events.db growth rate unknown at scale | LOW — mitigated by periodic cleanup | Implement cron job to archive events.db after 90 days |
| GP full-tunnel vs split-tunnel mode at GSISG unknown | MEDIUM — affects coexistence strategy | Ask GSISG IT admin for current GP configuration |
| Exact Azure reserved instance pricing varies | LOW — estimates within 5% | Use Azure Pricing Calculator for exact quote |
Sources & Tool Usage Log
Section titled “Sources & Tool Usage Log”Tools Used (20 distinct invocations)
Section titled “Tools Used (20 distinct invocations)”| Tool | Count | Purpose |
|---|---|---|
| mcp__claude_ai_Tavily__tavily_search | 12 | Web search for resource usage, SQLite/PostgreSQL, relay sizing, GP coexistence, Azure pricing, network-monitor flag |
| WebFetch | 7 | GitHub issues #5077, #4488, #1473, PRs #5155, #5156, Azure VM pricing, NetBird scaling docs |
| Bash | 1 | Directory listing for output path |
Key Sources
Section titled “Key Sources”GitHub Issues & PRs:
- #5077: WireGuard interface restart during GP connection — OPEN
- #5155: Check Windows Interfaces by Name/Description — CLOSED (not merged)
- #5156: Change Soft Interface detection to include Description — OPEN (not merged)
- #4488: Performance issue with large policy sets — OPEN (70 peers, 480 policies = 8 vCPU)
- #1473: External database for self-hosted — COMPLETED (PostgreSQL support added)
- #1824: 100 peer limit — CLOSED (fixed)
- #2522: Won’t work with other VPNs — OPEN (routing/DNS conflicts documented)
Official Documentation:
- Self-Hosting Quickstart — 1 CPU, 2 GB minimum
- Scaling Guide — No specific sizing guidance
- Understanding NAT — Relay architecture
- How NetBird Works — Architecture overview
- CLI Reference — —network-monitor flag
Azure Pricing:
- CloudPrice.net B2s — $0.0416/hr Linux
- Azure Bandwidth Pricing — First 100 GB free from North America
- Azure Managed Disks — P4 32 GB pricing
- Enforza Azure networking guide — Static IP $3.65/mo
Community:
- HN: NetBird discussion — “1k active users setup, super efficient and stable”
- Carl Pearson: Self-hosting on Hetzner — “Even the leanest VPS is enough”
- Cloudron Forum: NetBird experience — 1 vCPU, 2 GB successful
- WireGuard Performance Tuning — CPU overhead analysis