AWS Security Monitoring, Logging, and Alerting
In terms of AWS security, first the good news: Amazon Web Services offers an impressive collection of security monitoring and logging capabilities. Now the bad news: these tools are entirely too fragmented and complex, with a range of little-known gaps and complications which can be impermeable to even experienced cloud security professionals. The inspiration for this post is actually a series of misunderstandings I had myself on how things worked, despite years of aws security experience and testing.
With that preface, let’s dig into the details. Our goal is to lay out the different AWS security monitoring and logging sources, how to collect logs from them, and how to select the most appropriate collection technique. If you see any errors, definitely let me know, and I will correct them as quickly as possible.
What are the different AWS log mechanisms/repositories, and the fast vs. slow paths?
AWS offers a myriad of logs from various services, but they always end up in one of three different mechanisms, which correlate with two different storage repositories and a single bus. Okay, if AWS hadn’t already named a product EventBus I would say “a single event bus”, but they did so I can’t… but that’s essentially what it is. I know those sentences are confusing so let’s just dive in:
- Amazon S3: Multiple services (which we will list below) save logs into S3 buckets. Many service save a ZIP file every 5 minutes, each containing data collected in the previous monitoring window.
- CloudWatch Log Streams: CloudWatch is the native AWS monitoring service. It uses a log stream: a stream of log entries you can export, subscribe to, or view in the console. These log entries stay in CloudWatch but are accessible externally using a CloudWatch Logs Subscription.
- CloudWatch Events: CloudWatch Events are very different than Log Streams. A stream is a stored and exportable collection of events, while a CloudWatch Event is a single event only accessible by setting a CloudWatch Rule to send it someplace else. Events are ephemeral and not stored or saved. There is no concept of event history, unless you write rules to save events someplace.
Here is where things get complicated. Each mechanism saves (or exposes, in the case of Events) data in a different timeframe, which varies not only by mechanism but also by service. In terms of mechanisms:
- S3 has the longest delay, typically 10–20 minutes after the event occurred.
- CloudWatch Log Streams have variable timing, which is often close to the same as S3 but more variable — based not only on service, but also on other criteria such as log volume, which you can’t really predict. I’m running off rumors and some testing here — there is little to no documentation on CloudWatch Log Stream timing.
- CloudWatch Events are nearly instantaneous, and can generate alerts within seconds.
This is why I classify S3 and CloudWatch Log streams as “slow path” monitoring: delays of 10–15 minutes between activity and saved event or alert. CloudWatch Events is best for “fast path” monitoring, with delays of only a few seconds, depending on the service. (Note that actual analysis and alerting on those events takes more like 1–2 minutes total).
Why would I ever use slow path monitoring?
There are two good reasons to use slow path monitoring — often alongside fast path monitoring:
- To collect more activity; not all activity from all services is available in CloudWatch Events. For example CloudTrail only exposes write activity to CloudWatch Events, so you need something else to see any readAPI calls, such as requests to read database tables or gather instance details. Other services and data types, including VPC flow logs, simply aren’t available in CloudWatch Events.
- Easier storage; CloudWatch Events are not stored unless you create rules to save them to storage. Many services default to either S3 or CloudWatch Logs (or both) so storing each CloudWatch Event would create duplicates. For example CloudTrail always saves to S3 and/or CloudWatch Logs, so you might as well use that data for long-term access or other scenarios beyond rapid alerting.
What are the recommended security activity sources and which mechanisms do they use?
There are multiple sources for security-related activity in AWS. This table is fairly complete but may lack some service-specific logs frin services I encounter less frequently. All these services use the three mechanisms we just covered:
SERVICE | WHAT IT COLLECTS | SECURITY VALUE | S3 | CloudWatch Logs | CloudWatch Events |
---|---|---|---|---|---|
CloudTrail | Nearly all API calls, which includes console activity and AWS internal activity on your resources | High | Yes | Yes | Write activity only, and only when CloudWatch Logs enabled |
Recommendation: Use S3 and CloudWatch for storage/collection, and CloudWatch Events for alerting.
SERVICE | WHAT IT COLLECTS | SECURITY VALUE | S3 | CloudWatch Logs | CloudWatch Events |
---|---|---|---|---|---|
Config | Configuration, relationships, and state changes of resources, including a history of configurations. Also supports rules for auditing, compliance, and activity. | High (with a cost caveat) | Yes | No | Yes |
Recommendation: Config also sends SNS notifications. Config offers a lot of value but can be very expensive, even with the recent pricing changes. Thus we have to fall back on threat modeling and may recommend creating a custom configuration recorder to better manage costs while still collecting required data. Config rules are excellent for compliance status and are essential to Security Hub, but you may not need them if you use third-party tools (or you will use a subset of Config rules in combination with a third-party tool).
SERVICE | WHAT IT COLLECTS | SECURITY VALUE | S3 | CloudWatch Logs | CloudWatch Events |
---|---|---|---|---|---|
GuardDuty | AWS generated threat intelligence. Works best when other services like CLoudTrail and VPC Flow Logs are enabled | High | No | No | Yes |
Recommendation: Recommended for production accounts, and perform a threat model to decide if needed for development accounts.
SERVICE | WHAT IT COLLECTS | SECURITY VALUE | S3 | CloudWatch Logs | CloudWatch Events |
---|---|---|---|---|---|
VPC Flow Logs | “Netflow” like activity, including source and destination traffic patterns | Medium to High | Yes | Yes | No |
Recommendation: Use for production networks, especially “lift and shift” deployments where the VPC configuration is not modernized. Data only useful if you have network flow analysis capabilities, which are built into many tools, including GuardDuty.
SERVICE | WHAT IT COLLECTS | SECURITY VALUE | S3 | CloudWatch Logs | CloudWatch Events |
---|---|---|---|---|---|
Security Hub | Centralized security assessments and alerts, including from third-party services | Medium to High, depending on services enabled | No | No | Yes |
Recommendation: Security Hub is new but the future of AWS security efforts. We are currently integrating since it provides a good dashboard, but the actual log/alert feeds may be of lower value if you collect them directly from the supported services.
SERVICE | WHAT IT COLLECTS | SECURITY VALUE | S3 | CloudWatch Logs | CloudWatch Events |
---|---|---|---|---|---|
Inspector | Vulnerability assessment (host and some network) | Medium | No | Yes | No |
Recommendation: Not needed for immutable deployments or if you use a third-party vulnerability assessment tool.
SERVICE | WHAT IT COLLECTS | SECURITY VALUE | S3 | CloudWatch Logs | CloudWatch Events |
---|---|---|---|---|---|
Macie | DLP/content scans of S3 | Medium to High | No | No | Yes |
Recommendation: Macie can be expensive, but is one of your only real content-aware data protection options for S3. You will need to perform a threat model to figure out if Macie is a fit, it isn’t a no-brainer like CloudTrail.
SERVICE | WHAT IT COLLECTS | SECURITY VALUE | S3 | CloudWatch Logs | CloudWatch Events |
---|---|---|---|---|---|
Trusted Advisor | A collection of overall assessments of your account, with security, cost, and operations recommendations. Depth of checks depends on your support plan. | Medium | No | No | Yes |
Recommendation: Always worth collecting these events, especially if you have a support plan for the extended checks (included in most plans beyond Standard). It will identify, using scheduled scans, findings such as public S3 buckets.
SERVICE | WHAT IT COLLECTS | SECURITY VALUE | S3 | CloudWatch Logs | CloudWatch Events |
---|---|---|---|---|---|
S3/Load Balancer (ELB/ALB) Access Logs | “Best effort” logs for S3 and load balancer access. | Medium | Yes | No | No |
Recommendation: Enable for production workloads. This is the same as any other load balancer or storage access logging, but remember that AWS does not guarantee they are complete (lack of an entry does not prove lack of access).
SERVICE | WHAT IT COLLECTS | SECURITY VALUE | S3 | CloudWatch Logs | CloudWatch Events |
---|---|---|---|---|---|
API Gateway Execution and Access Logs | API Gateway activity, including failed executions. Access logs add who accessed the API and how. | Medium | No | Yes | No |
Recommendation: As with any access logging, the value depends on what the service is being used for and if you build monitoring and alerting using these logs.
SERVICE | WHAT IT COLLECTS | SECURITY VALUE | S3 | CloudWatch Logs | CloudWatch Events |
---|---|---|---|---|---|
S3 Object Actions and Lambda Function Invocations | By default CloudTrail collects bucket level API calls, but not object level calls. However, you can enable object level calls on all objects or only some objects as options in CloudTrail. By default Lambda activities recorded but not function invocations (triggering a function). This can also be enabled. | Medium to High | Yes | Yes | Yes |
Recommendation: The big concern here is cost, since if you are performing a large amount of object level activity you could saturate CloudTrail. The same is true for Lambda invocations, which could add many thousands to even millions of records to your log files. This is a cost vs. security decision that isn’t always easy and threat modeling is, again, your friend.
SERVICE | WHAT IT COLLECTS | SECURITY VALUE | S3 | CloudWatch Logs | CloudWatch Events |
---|---|---|---|---|---|
Lambda Functions, RDS, etc. | Various resources services save their log files with high variability on the content. Fir example, print statements in Lambda functions are saved to a CloudWatch Log Stream dedicated to the function. | Low to Medium | Sometimes | Yes | Only if manually created |
Recommendation: Treat these like any other application or database logs.
SERVICE | WHAT IT COLLECTS | SECURITY VALUE | S3 | CloudWatch Logs | CloudWatch Events |
---|---|---|---|---|---|
Route53 | DNS activity for zones/destinations managed by Route53 | Medium | No | Yes, but only in us-east–1 | No |
Recommendation: This requires you to forward the logs from us-east–1 since that is the only place they are saved. This may be an issue for non-US users.
SERVICE | WHAT IT COLLECTS | SECURITY VALUE | S3 | CloudWatch Logs | CloudWatch Events |
---|---|---|---|---|---|
AWS WAF | Web ACL traffic information for the configured web ACL (which means you have to configure each one separately, there isn’t a single WAF logging switch) | Medium | No | No | No |
Recommendation: This one is unusual and will only deliver logs to Kinesis.
How do AWS Regions affect monitoring?
Hmm. Not well?
AWS Regions are an extremely valuable tool for segregation and blast radius control. You want regions to only have limited connections between each other, always under customer control, so you can maintain compliance and segregation. The problem is that AWS offers few mechanisms for managing things which customers want to span regions. Monitoring is the biggest in my book, especially since IAM is already cross-region.
Here is what you need to know:
- CloudTrail is the only multi-region service we listed. We will discuss that in a moment because undocumented nuances can have massive impact.
- CloudWatch Log Subscriptions are the mechanism to send CloudWatch Log Streams someplace else. These can work cross-region, but you may need to use the API rather than the console, depending on destination.
- CloudWatch Rules can send cross-region by setting two different kinds of targets. They can send to an SNS Topic which then forwards to wherever you want the event. The other option is to send to a Lambda function and code up a custom forwarder. Don’t worry — it’s fairly easy and we offer sample code below.
How do I work with CloudWatch Events if they aren’t stored and are region restricted?
Moving CloudWatch Events across regions is the single most frustrating aspect of collecting activity in AWS. My general recommendations are:
- The cloud moves fast so you need fast-path alerts. This means relying on CloudWatch Events, which native services like Security Hub support far more often than S3 or log streams.
- Create CloudWatch Rules to send events to a Lambda function forwarder.
- That Lambda can send to a Kinesis Stream, Firehose, or wherever you want. Firehose is best for Splunk fans. A Kinesis Data Stream is more useful for StreamAlert fans.
- Use log subscriptions for CloudTrail and VPC Flow Logs, in addition to CloudWatch Events. They are the only way to get flow logs and CloudTrail read events.
Here is sample code to forward CloudWatch Events to a Kinesis Data Stream. I lifted it from the Securosis Advanced Cloud Security training class, where we have students build this out:
import json
import boto3
def lambda_handler(event, context):
kinesis = boto3.client('kinesis', region_name='us-west-2')
data = json.dumps(event)
print(data)
response = kinesis.put_record(
StreamName='cloudsec_prod_stream_alert_kinesis',
Data=data,
PartitionKey='test-client-id'
)
print(response)
return {
'statusCode': 200,
'body': json.dumps('Record added')
}
So what’s the deal with CloudTrail?
A few CloudTrail nuances are critical to security pros:
- An Organization trail will pull all activity from all accounts and regions in the organization. Those of you in very large organizations may hit service limits, but for everyone else I recommend turning this on and following the advice in the next bullet…
- For individual accounts, or if you aren’t in an AWS Organization, turn on the trail for all regions.
- This is your main trail for collecting all read and write activity. Make sure you select the option to collect all management activity (at least to Streams, even if not to Events).
- Send this trail to CloudWatch Logs and configure a CloudWatch Log Subscription to forward the logs to your destination monitoring solution (e.g., Splunk or StreamAlert).
- Then turn on a CloudTrail in every region you care about!! This part sucks. This is the only way to get CloudWatch Events for each region. If you have an Organization or All Accounts trail, it still won’t pull Events from regions where API calls initiate. These are essential for fast-path monitoring, but without an actual Trail configured for each region you use, you will still face the 5–20 minute CloudTrail delay.
- S3 is always a better long-term log repository, and an easy way to centralize logs across accounts and regions. If you haven’t played with AWS Athena, take a look — you can build cool dashboards right on top of S3. Just never forget this is the slow path.
- Once you set a Trail in a region, it will create CloudWatch Events which you can forward to your monitoring solution using a CloudWatch Rule. However, you need to know the secret rule pattern. The documentation you read might be wrong, but here is the pattern you need:
- {
“detail-type”: [
“AWS API Call via CloudTrail”
] }
Putting it all together
The diagram above shows what you need to do in each region of each account to collect activity. The primary alternative option adds the concept of an EventBus to pull CloudWatch Events from multiple accounts into a single account where the forwarding Lambda functions live. I’m souring on that approach, because you still need to build CloudWatch Rules in each region of each account to send to the EventBus, so it barely saves any effort.
How do I manage this at scale?
First, build the forwarding architecture into standard CloudFormation or Terraform templates for every account you provision. Once you get past a handful of accounts you really need to automate provisioning of baseline services when you create new accounts.
Second, use DisruptOps… or build your own automation to ensure all the wiring stays in place and isn’t broken by local administrators or attackers. CloudFormation and Terraform are excellent for initial provisioning (we use them) but even with drift detection, they aren’t always great at fixing broken things.