AWS Athena with VPC Flow Logs
The goal of this article comes from project requirements. We wanted to understand how much traffic is going between EKS master and worker nodes in AWS to understanding AWS costs. We wanted to understand the price per 1GB traffic between AZs (same region) and for different regions
Project requirements:
- Get total IN/OUT of the traffic for day/month
- Get total IN/OUT of the traffic by source/destination IP addresses
We could achieve that with Grafana dashboards and Container Insights boards but it was very unusable for the total in/out. After tests and research, I found AWS's official and easy way with VPC flow logs. VPC will send all logs with the structure which we will define to the AWS S3 but to extract and use this collected data we must use AWS ATHENA.
- VPC flow logs are responsible to collect and send logs with some range interval to the AWS s3
- AWS Athena gives us the functionality to use SQL syntax to collect data from VPC flow logs.
To use this stuff we need an AWS S3 bucket to store logs from VPC in the same region where deployed EKS cluster(In my case it is deployed in the us-east-1 region. For the security reason data encryption is enabled for this bucket). Terraform codes will catch VPC id from EKS cluster name which is eks_cluster_name variable inside of the vars.tf file. defined as ARN of the S3 will be used at the creation time of VPC Flow logs. After that AWS Athena will create glue to use database name flowlog_gb_glue_name variable from vars.tf file. All this stuff will be created with terraform codes. Just clone this repository and execute the following commands (of course variables for the access and secret key with region must be defined).
$ git clone https://github.com/jamalshahverdiev/terraform-aws-athena-vpc-flowlogs.git && cd terraform-aws-athena-vpc-flowlogs
$ terraform init
$ terraform plan && echo yes | terraform apply
When terraform code files creates vpc flow logs it is going to use new created AWS S3 ARN, filter ALL type of traffic, set 10 minutes of the aggregation interval(we will do partition of the logs for every 24 hours), and choose log record format to Custom as the following
${version} ${account-id} ${instance-id} ${interface-id} ${pkt-srcaddr} ${srcaddr} ${srcport} ${pkt-dstaddr} ${dstaddr} ${dstport} ${protocol} ${packets} ${bytes} ${start} ${end} ${action} ${log-status} ${az-id}
After some time go to the S3 bucket look at the folder structure created there:
Then open Athena(Query Data in S3 using SQL) service in AWS and define the place where will be saved all results of our queries. In my case is: s3://s3-bucket-name/jamal.shahverdiyev
Create a table for the log structure inside of the database which we have defined in flowlog_gb_glue_name variable of vars.tf file. The structure of the fields must be the same as we defined inside of the flow logs when we created them. Please define the correct S3 name and your own Account ID(Red deleted space) of AWS:
CREATE EXTERNAL TABLE IF NOT EXISTS flowloggluedb.vpc_flow_logs (
version int,
account string,
instanceid string,
interfaceid string,
pktsrcaddr string,
sourceaddress string,
sourceport int,
pktdstaddr string,
destinationaddress string,
destinationport int,
protocol int,
numpackets int,
numbytes bigint,
starttime int,
endtime int,
action string,
logstatus string,
azid string
) PARTITIONED BY (
dt string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LOCATION 's3://s3-bucket-name/AWSLogs/write-your-aws-account-id/vpcflowlogs/us-east-1/' TBLPROPERTIES ("skip.header.line.count"="1");
Create a partition to quick search (Don’t forget to write the right bucket name and Account ID)
ALTER TABLE flowloggluedb.vpc_flow_logs
ADD PARTITION (dt='2021-12-01')
location 's3://s3-bucket-name/AWSLogs/write-your-aws-account-id/vpcflowlogs/us-east-1/2021/12/01';
Send SQL query to get total traffic in megabytes from EKS worker nodes
SELECT sourceaddress AS SOURCE_ADDRESS, SUM(numbytes)/1024/1024 AS TOTAL_IN_MB FROM flowloggluedb.vpc_flow_logs
WHERE action = 'ACCEPT' AND sourceaddress in (
'10.10.10.151', '10.10.10.174', '10.10.10.229', '10.10.10.40') GROUP BY sourceaddress;
After queries results will be saved in the S3 bucket like as the following screen
In this way, we can formulate any type of SQL queries to get information between services, pods, and nodes(source/destination). To send SQL query with some loop we can use aws cli or Boto3(Example in this script). I hope it will be useful to someone