WebsiteTemplate/docs/ANALYTICS_ACCURACY.md
2026-01-25 11:33:37 -04:00

144 lines
4.4 KiB
Markdown

# Analytics Accuracy Guide
## How to Verify Analytics Accuracy
Run the verification script to check for discrepancies:
```bash
php /var/www/verify-analytics.php [YYYY-MM-DD]
```
If no date is provided, it checks today's data.
## Current Accuracy Issues Found
### 1. **Returning Visitor Count Bug** ⚠️
The summary shows incorrect returning visitor counts. The script counts unique returning visitors, but the summary logic appears flawed.
**Impact**: Returning visitor numbers are inflated.
### 2. **RSS Click Tracking**
RSS clicks are tracked in two ways:
- Button clicks on the page (tracked via JavaScript)
- Actual RSS feed fetches (tracked via PHP in `feed.php`)
**Impact**: RSS numbers may be double-counted or inconsistent.
### 3. **No Bot Filtering**
Bot traffic (search engines, crawlers) is currently counted as regular visitors.
**Impact**: Numbers may be inflated by 10-30% depending on site popularity.
### 4. **Ad Blockers**
Users with ad blockers may block the analytics script entirely.
**Impact**: Numbers may be deflated by 5-15% (depending on user base).
### 5. **Self-Visits**
Your own visits are not filtered out.
**Impact**: Development/testing visits inflate numbers.
### 6. **Duplicate Pageviews**
Same visitor, same page, within 5 seconds = potential duplicate.
**Impact**: Rapid navigation or page refreshes create duplicates.
### 7. **New vs Returning Logic**
Currently only checks within the same day. A visitor who came yesterday but returns today is counted as "new" again.
**Impact**: Returning visitor counts are inaccurate across days.
## Factors Affecting Accuracy
### ✅ What IS Tracked Accurately:
- Pageview timestamps (hourly breakdown is recalculated from raw data)
- Share counts (when JavaScript executes)
- Reaction counts (stored separately, very accurate)
### ⚠️ What May Be Inaccurate:
- **Total visits**: May include bots, duplicates, self-visits
- **New vs Returning**: Only accurate within same day
- **RSS clicks**: May have double-counting issues
- **Unique visitors**: Uses localStorage, can be cleared/blocked
## Recommendations to Improve Accuracy
### 1. **Filter Bot Traffic**
Add bot detection in `track.php`:
```php
// Check user agent for bots
$ua = $_SERVER['HTTP_USER_AGENT'] ?? '';
$isBot = preg_match('/bot|crawler|spider|scraper/i', $ua);
if ($isBot) {
// Skip tracking or mark as bot
}
```
### 2. **Filter Self-Visits**
Add your IP(s) to a blocklist in `track.php`:
```php
$yourIPs = ['YOUR_IP_HERE', 'ANOTHER_IP'];
if (in_array($_SERVER['REMOTE_ADDR'], $yourIPs)) {
// Skip tracking
}
```
### 3. **Fix Returning Visitor Logic**
Store visitor history across days, not just within the same day.
### 4. **Deduplicate Rapid Pageviews**
Add a cooldown period (e.g., same visitor + same page + <10 seconds = ignore).
### 5. **Separate RSS Tracking**
Distinguish between:
- RSS button clicks (user intent)
- RSS feed fetches (automatic, may be bots)
## Understanding Your Numbers
### Realistic Accuracy Range
- **Pageviews**: ±15-25% (due to bots, ad blockers, duplicates)
- **Unique Visitors**: ±20-30% (localStorage can be cleared/blocked)
- **Shares**: ±5% (very accurate, requires JavaScript)
- **Reactions**: ±1% (very accurate, stored server-side)
### What the Numbers Mean
- **Total Visits**: All page loads, including bots and duplicates
- **New Visitors**: First-time visitors today (not lifetime)
- **Returning Visitors**: Visitors who visited earlier today (not yesterday)
- **Hourly Breakdown**: Accurate (recalculated from timestamps)
## Best Practices
1. **Run verification script regularly** to catch discrepancies
2. **Focus on trends** rather than absolute numbers
3. **Compare with server logs** for validation
4. **Filter your own IP** for more accurate numbers
5. **Monitor for anomalies** (sudden spikes may be bots)
## Quick Accuracy Check
```bash
# Check today's data
php /var/www/verify-analytics.php
# Check specific date
php /var/www/verify-analytics.php 2025-12-28
# Look for:
# - Discrepancies between summary and raw data
# - High bot counts
# - Duplicate pageviews
# - Rapid-fire visits
```
## Expected Accuracy
For a typical personal blog:
- **Pageviews**: 70-85% accurate (after accounting for bots/ad blockers)
- **Unique Visitors**: 60-75% accurate (localStorage limitations)
- **Engagement** (shares/reactions): 95%+ accurate
The analytics are **good enough for trends and general insights**, but don't rely on exact numbers for critical decisions.