Sampling

Sampling reduces log volume by keeping only a percentage of matching events. This is useful for cost optimization on high-volume, low-value log sources while maintaining statistical visibility into patterns and trends.

Sampling is configured as a pipeline step. Navigate to Pipelines → Edit your pipeline → Add a Sample step.

When to Use Sampling

Sampling is ideal when you need to:

Reduce storage costs for high-volume sources without losing visibility
Maintain statistical patterns while reducing event count
Optimize performance for sources generating millions of events daily
Keep representative data for trend analysis and capacity planning

⚠️

Not for Security-Critical Events: Don’t sample events that require full fidelity for security investigations (authentication failures, privilege escalation, data access). Use sampling only for operational or informational logs.

How Sampling Works

When you add a Sample step to a pipeline:

Events matching the step’s preconditions are evaluated
A percentage of matching events (e.g., 10%) are kept
The remaining events are discarded before reaching storage
Non-matching events pass through unaffected

Configuration

Setting	Description	Example
Sample Rate	Percentage of events to keep (1-99%)	10%
Preconditions	Which events the sampling applies to	`eventName regex: ^Get.*`

Use Cases

Sample AWS S3 Data Events

S3 data events (GetObject, PutObject, ListObjects) generate extremely high volumes in active environments. Sample 10% to reduce costs while maintaining visibility into access patterns.

{
  "eventName": "GetObject",
  "eventSource": "s3.amazonaws.com",
  "awsRegion": "us-east-1",
  "userIdentity": {
    "arn": "arn:aws:iam::123456789012:role/data-pipeline"
  },
  "requestParameters": {
    "bucketName": "company-data-lake",
    "key": "logs/2024/01/15/events.json"
  }
}

Cost Impact

Before Sampling~5M events/day

After Sampling (10%)~500K events/day

Storage Savings~90%

Sample CloudFlare HTTP Requests

CDN and web traffic logs can generate millions of events daily. Sample successful requests while keeping all errors at full fidelity for debugging.

{
  "ClientRequestHost": "api.company.com",
  "ClientRequestMethod": "GET",
  "ClientRequestURI": "/v1/users",
  "EdgeResponseStatus": 200,
  "ClientIP": "203.0.113.50",
  "RayID": "8a1b2c3d4e5f6g7h"
}

By using a precondition that only matches successful (200) responses, all error responses (4xx, 5xx) are kept at full fidelity for debugging and security analysis.

Sample VPC Flow Logs

VPC Flow Logs capture all network traffic metadata, generating massive volumes. Sample accepted traffic while keeping all rejected connections for security analysis.

{
  "version": 2,
  "account-id": "123456789012",
  "interface-id": "eni-0123456789abcdef0",
  "srcaddr": "10.0.1.100",
  "dstaddr": "10.0.2.50",
  "srcport": 443,
  "dstport": 49152,
  "protocol": 6,
  "packets": 10,
  "bytes": 840,
  "action": "ACCEPT"
}

What’s Preserved

REJECT events100% retained

ACCEPT events5% sampled

Best Practices

Start with higher sample rates (e.g., 25%) and reduce gradually based on your analysis needs
Use preconditions to sample only specific event types, keeping security-relevant events at full fidelity
Combine with enrichments to tag events before sampling decisions
Monitor your dashboards after enabling sampling to ensure statistical patterns remain visible
Document your sampling strategy so analysts know which data is sampled

Sample vs Filter vs Drop

Action	What Happens	When to Use
Sample	Keeps a percentage of events randomly	High-volume operational logs where statistical patterns are sufficient
Filter	Archives events (still searchable)	Events you rarely need but must retain for compliance
Drop	Permanently removes events	Events with zero value that should never be stored

Using The API Drop