Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
309 changes: 308 additions & 1 deletion .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,24 @@ These instructions guide GitHub Copilot to provide suggestions and responses ali

---

## ⚠️ MANDATORY INCIDENT INVESTIGATION CHECKLIST

**When user reports an authentication/enrollment issue, incident, or troubleshooting request:**

- [ ] **STEP 1**: Query DRI Copilot MCP tool (`Broker_DRI_Copilot` ) with symptoms
- [ ] **STEP 2**: Analyze logs/evidence - extract correlation IDs, error codes, timestamps
- [ ] **STEP 3**: Search codebase for error handling and implementation
- [ ] **STEP 4**: Synthesize diagnosis, stating ONLY what evidence shows (not assumptions)
- [ ] **STEP 5**: Clearly state what evidence is MISSING

**DO NOT:**
- ❌ Skip DRI Copilot query
- ❌ Make claims without log evidence
- ❌ Diagnose based on "common issues" without seeing actual error
- ❌ Assume authentication/enrollment happened if logs don't show it

---

## 1. Repository Structure & Architecture

### 1.1 Repository Organization
Expand Down Expand Up @@ -47,6 +65,113 @@ Client App
- **Common Module:** Contains all IPC (Inter-Process Communication) logic. MSAL/OneAuth use Common layer to send requests to Broker over IPC
- **Broker Module:** Handles the actual authentication logic, communicates with eSTS, and returns tokens

### 1.3 DRI Copilot MCP Server Usage

**🔴 MANDATORY: Query DRI Copilot MCP tools FIRST for ANY troubleshooting or incident investigation request.**

When users ask for:
- **Troubleshooting** ANY authentication or enrollment issue
- **Investigating** customer-reported problems or IcM incidents
- **Documentation** (design docs, architecture docs, API docs)
- **Troubleshooting guides** (TSGs)
- **Past incidents** or incident resolution
- **Error explanations** or common issues
- **Onboarding information**

**⚠️ DO NOT skip this step. DO NOT assume you know the answer. ALWAYS query DRI Copilot MCP tools FIRST.**

#### Available DRI Copilot Tools:
Look for MCP tools with these patterns in their names:
1. **Broker DRI Copilot** (tools containing `Broker_DRI_Copilot`)
- Use for: Broker-related questions, PRT, device registration, brokered auth flows, TSGs, past incidents

The exact tool names will vary based on the MCP server name configured in `.vscode/mcp.json`.

#### Mandatory Workflow (NO EXCEPTIONS):
1. **FIRST**: Query DRI Copilot MCP tool (Broker_DRI_Copilot)
- Include: user symptoms, error messages, and any log evidence
- Get: TSG steps, past incidents, known issues, troubleshooting guidance
2. **SECOND**: Analyze provided logs/evidence (see Section 1.4)
- Extract: correlation IDs, error codes, timestamps
- Identify: what operations occurred, what's missing
3. **THIRD**: Search codebase for relevant implementation
- Find: error handling, code paths, known bugs
4. **FINALLY**: Synthesize diagnosis combining all three sources

#### Example Queries:
- "What is PRT?" → Query Broker DRI Copilot
- "How to troubleshoot auth_cancelled_by_sdk?" → Query both MCP servers
- "Authenticator onboarding documentation" → Query Authenticator DRI Copilot
- "Past incidents related to [issue]" → Query relevant MCP server(s)

### 1.4 Incident Investigation Guidelines (IcM/Customer-Reported Issues)

**🔴 CRITICAL: Follow this EXACT sequence for ALL incident investigations:**

**STEP 0 (MANDATORY FIRST STEP):**
- **Query DRI Copilot MCP tools** with user symptoms and error messages
- Get TSG guidance, past incidents, and known issues
- This step CANNOT be skipped

**STEP 1: Analyze Provided Evidence**
When investigating customer-reported incidents or IcM tickets, follow this **evidence-first approach**:

#### **Priority Hierarchy:**
1. **Direct Evidence from Logs/Data** (Highest Priority)
- Actual log files, stack traces, error codes, correlation IDs
- Telemetry data from Kusto (android_spans, eSTS logs)
- Concrete timestamps, device IDs, user IDs
- Network traces, API responses, HTTP status codes

2. **Code Analysis** (High Priority)
- Current implementation in the codebase
- Recent code changes or commits related to the issue
- Known bugs or limitations in the code

3. **Documentation & TSGs** (Supporting Priority)
- Use documentation to **augment and explain** what you observe in evidence
- Reference TSGs for **additional troubleshooting steps**, not as primary diagnosis
- Documentation provides context, not conclusions

#### **Critical Rules:**
- **NEVER make claims without evidence**: If logs don't show something happened, state "logs don't show X" rather than claiming X occurred
- **Distinguish observation from inference**: Clearly label when you're inferring vs. observing
- ✅ "Logs show `Number of Microsoft Accounts: 0`, suggesting no enrollment completed"
- ❌ "Enrollment failed after authentication" (when logs don't show the enrollment attempt)
- **Challenge documentation against evidence**: If documentation says X should happen, but logs show Y, trust the evidence
- **State what's missing**: If critical evidence is absent (e.g., no correlation IDs, no error codes), explicitly state what additional data is needed

#### **Investigation Workflow:**
1. **Analyze provided evidence first** (logs, errors, telemetry)
- Extract: correlation IDs, timestamps, error codes, device/user identifiers
- Identify: actual operations performed, their outcomes, any exceptions
- Note: what evidence is present and what is missing

2. **Search codebase for relevant implementation**
- Find the code paths that should have executed
- Check for known issues, error handling, logging statements

3. **Query documentation/TSGs** to augment understanding
- Use to explain error codes, suggest additional diagnostics
- Reference common patterns, but validate against actual evidence

4. **Formulate diagnosis**
- State what evidence supports each hypothesis
- Rank hypotheses by strength of evidence (not by documentation frequency)
- Clearly separate proven facts from educated guesses

5. **Recommend next steps**
- Prioritize actions that gather missing evidence
- Suggest diagnostics based on what the evidence shows, not just what's common

#### **Example - Good vs. Bad Analysis:**

**❌ Bad (Documentation-First):**
> "Based on TSG documentation, device cap is the most common cause (60% of cases), so this is likely a device cap issue."

**✅ Good (Evidence-First):**
> "Logs show `GetBrokerAccounts` returning 0 accounts, but no enrollment attempt is captured. Without correlation IDs or eSTS error codes, I cannot determine the root cause. The TSG indicates device cap is common (60%), but we need enrollment attempt logs to confirm. Next step: Collect logs with correlation IDs during reproduction."

## 2. Core Principles

* **Primary Language for New Code:** All new code and new files **must be written in Kotlin**.
Expand Down Expand Up @@ -199,7 +324,7 @@ Look in: Adapter classes, controller classes

## 13. Telemetry & Analytics with Azure Data Explorer (Kusto)

### 13.1 Cluster Information
### 13.1 Cluster Information for Android broker/common/MSAL libraries.
* **Primary Cluster:** `https://idsharedeus2.kusto.windows.net/`
* **Production Database:** `ad-accounts-android-otel`
* **Sandbox Database:** `android-broker-otel-sandbox`
Expand Down Expand Up @@ -438,4 +563,186 @@ android_spans
- **MCP Server:** Use the `mcp_my-mcp-server_execute_query` tool to run Kusto queries from Copilot
- **Schema Discovery:** Use `mcp_my-mcp-server_get_table_schema` to explore available fields in tables

---

## 14. eSTS (Token Service) Investigations with Azure Data Explorer (Kusto)

### 14.1 Cluster Information
* **eSTS Cluster:** `https://estswus2.kusto.windows.net/`
* **Database:** `ESTS`
* **Primary Table:** `AllPerRequestTable` - Union view spanning multiple Kusto clusters (estsfrc, estssec, estsdb3, etc.)
- Contains all token service request/response data across global eSTS deployments
- Queries fan out to physical `PerRequestTableIfx` tables in each cluster
- Use `AllPerRequestTable` for comprehensive cross-cluster queries
* **Purpose:** eSTS is Microsoft's token service. Android team is a client of eSTS, so we investigate Android-related authentication requests and responses

### 14.2 Android-Specific Filtering
**ALWAYS filter by Android platform when investigating Android issues:**
```kql
PerRequestTableIfx
| where env_time >= ago(7d)
| where DevicePlatformForUI == "Android" // Primary Android filter
```

**Alternative Android platform fields:**
- `DevicePlatform` - Raw device platform string
- `DevicePlatformForUI` - User-friendly platform name (preferred for filtering)

### 14.3 Key Fields for Android Investigations

**Request Identification:**
- `CorrelationId` - Correlates request across Android client → Broker → eSTS
- `RequestId` - Unique identifier for the eSTS request
- `env_time` - Request timestamp

**Request Type & Flow:**
- `Call` - Type of authentication call (e.g., "token", "authorize", "common/oauth2/v2.0/token")
- `IsInteractive` - Boolean indicating if user interaction was required
- `Prompt` - Prompt type (e.g., "none", "login", "consent")

**Authentication Status:**
- `Result` - Overall result status (e.g., "Success", "Failure")
- `ErrorCode` - Error code if request failed
- `ErrorNo` - Numeric error code
- `HttpStatusCode` - HTTP status code of the response
- `SubErrorCode` - Additional error details

**PRT (Primary Refresh Token) Information:**
- `PrtData` - Contains PRT-related data (check if PRT was sent in request)
- Parse this JSON field to determine if PRT was used
- Example: `| extend HasPRT = isnotempty(PrtData)`

**Device & Client Information:**
- `DeviceId` - Unique device identifier
- `ApplicationId` - Client application ID
- `UserAgent` - Browser/client user agent string

**Timing Information:**
- `ResponseTime` - Total response time in milliseconds
- `StsProcessingTime` - eSTS processing time

**User & Account Information:**
- `TenantId` - Tenant identifier
- `UserTenantId` - User's home tenant identifier
- `UserPrincipalObjectID` - User's object ID in Entra ID (unique identifier for the user)
- `AccountType` - Type of account (AAD, MSA, etc.)
- `UserType` - Type of user

### 14.4 Common Query Patterns

**Find requests by CorrelationId:**
```kql
AllPerRequestTable
| where env_time >= ago(7d)
| where DevicePlatformForUI == "Android"
| where CorrelationId == "your-correlation-id-here"
| project env_time, CorrelationId, Call, Result, ErrorCode, IsInteractive, PrtData, ResponseTime
```

**Check if PRT was used in requests:**
```kql
AllPerRequestTable
| where env_time >= ago(7d)
| where DevicePlatformForUI == "Android"
| extend HasPRT = isnotempty(PrtData)
| summarize
total_requests = count(),
prt_requests = countif(HasPRT),
success_rate = round(100.0 * countif(Result == "Success") / count(), 2)
by HasPRT
```

**Analyze error patterns:**
```kql
AllPerRequestTable
| where env_time >= ago(7d)
| where DevicePlatformForUI == "Android"
| where Result != "Success"
| summarize error_count = count() by ErrorCode, SubErrorCode, Call
| order by error_count desc
| take 20
```

**Interactive vs Silent requests:**
```kql
AllPerRequestTable
| where env_time >= ago(7d)
| where DevicePlatformForUI == "Android"
| summarize
count = count(),
success_rate = round(100.0 * countif(Result == "Success") / count(), 2),
avg_response_time = avg(ResponseTime)
by IsInteractive
```

**Requests by application:**
```kql
AllPerRequestTable
| where env_time >= ago(7d)
| where DevicePlatformForUI == "Android"
| summarize
request_count = count(),
error_count = countif(Result != "Success"),
avg_response_time = avg(ResponseTime)
by ApplicationId
| extend error_rate = round(100.0 * error_count / request_count, 2)
| order by request_count desc
```

**Analyze users and devices:**
```kql
AllPerRequestTable
| where env_time >= ago(7d)
| where DevicePlatformForUI == "Android"
| where ApplicationId == "your-app-id-here"
| summarize
request_count = count(),
unique_devices = dcount(DeviceId),
first_seen = min(env_time),
last_seen = max(env_time)
by UserPrincipalObjectID, PUID, TenantId
| order by request_count desc
```

### 14.5 Correlating Android Spans with eSTS Requests

**To trace a complete flow (Android client → Broker → eSTS):**

1. **Start with Android span:**
```kql
// In android-broker-otel-sandbox or ad-accounts-android-otel database
android_spans
| where EventInfo_Time >= ago(7d)
| where span_name == "AcquireTokenInteractive"
| where error_code == "some_error"
| project correlation_id, span_id, EventInfo_Time, error_code
| take 100
```

2. **Find corresponding eSTS requests:**
```kql
// In ESTS database
AllPerRequestTable
| where env_time >= ago(7d)
| where DevicePlatformForUI == "Android"
| where CorrelationId in ("correlation-id-1", "correlation-id-2", ...) // From step 1
| project env_time, CorrelationId, Call, Result, ErrorCode, PrtData, ResponseTime
```

### 14.6 Query Optimization Tips

1. **Always filter by time first** - Use `| where env_time >= ago(Xd)` at the start
2. **Filter by Android platform early** - Add `| where DevicePlatformForUI == "Android"` immediately after time filter
3. **Use `take` for exploration** - Limit results with `| take 1000` to avoid timeouts
4. **Project early** - Select only needed columns with `| project` to reduce data transfer
5. **Check field population** - Use `isnotempty()` before parsing fields like `PrtData`

### 14.7 Important Notes

- **Data Retention:** Check with eSTS team for current retention policy
- **Sensitive Data:** Many fields are hashed or scrubbed for privacy (e.g., PUID, UserNameHash)
- **Cross-Cluster Queries:** Cannot directly join Android telemetry cluster with eSTS cluster; must correlate via CorrelationId manually
- **PrtData Parsing:** `PrtData` is a JSON string; use `parse_json()` or `extend` to extract specific fields
- **MCP Server:** Use `mcp_my-mcp-server_execute_query` with cluster URL `https://estswus2.kusto.windows.net` and database `ESTS`

---