How can I use partition projection to create CloudTrail tables for Athena queries?

5 minute read
1

When I use Amazon Athena to query my AWS CloudTrail data, the queries take a long time to run or they time out.

Resolution

CloudTrail logs can grow in size over time even if you partition the CloudTrail table to reduce the runtime of the queries. Queries against a highly partitioned table have a high plan time and don't complete quickly.

To resolve the timeout issue, use partition projection to manually create a CloudTrail table. This allows Athena to dynamically calculate the value of CloudTrail tables, which reduces query runtime. With partition projection, you don't need to manage partitions because partition values and locations are calculated from the configuration.

To create a CloudTrail table partitioned by timestamp with partition projection, see Creating the table for CloudTrail logs in Athena using partition projection.

To create a CloudTrail table from multiple accounts that's partitioned by year, month, and day with partition projection, use a command similar to the following:

CREATE EXTERNAL TABLE ctrail_pp_ymd (eventversion STRING,
useridentity STRUCT<
               type:STRING,
               principalid:STRING,
               arn:STRING,
               accountid:STRING,
               invokedby:STRING,
               accesskeyid:STRING,
               userName:STRING,
sessioncontext:STRUCT<
attributes:STRUCT<
               mfaauthenticated:STRING,
               creationdate:STRING>,
sessionissuer:STRUCT<
               type:STRING,
               principalId:STRING,
               arn:STRING,
               accountId:STRING,
               userName:STRING>>>,
eventtime STRING,
eventsource STRING,
eventname STRING,
awsregion STRING,
sourceipaddress STRING,
useragent STRING,
errorcode STRING,
errormessage STRING,
requestparameters STRING,
responseelements STRING,
additionaleventdata STRING,
requestid STRING,
eventid STRING,
resources ARRAY<STRUCT<
               ARN:STRING,
               accountId:STRING,
               type:STRING>>,
eventtype STRING,
apiversion STRING,
readonly STRING,
recipientaccountid STRING,
serviceeventdetails STRING,
sharedeventid STRING,
vpcendpointid STRING
)
PARTITIONED BY (account string, region string, year string, month string, day string)
ROW FORMAT SERDE
  'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT
  'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://doc_example_bucket/AWSLogs/'
TBLPROPERTIES (
  'projection.enabled'='true',
  'projection.day.type'='integer',
  'projection.day.range'='01,31',
  'projection.day.digits'='2',
  'projection.month.type'='integer',
  'projection.month.range'='01,12',
  'projection.month.digits'='2',
  'projection.region.type'='enum',
  'projection.region.values'='us-east-1,us-east-2,us-west-2',
  'projection.year.type'='integer',
  'projection.year.range'='2015,2021',
  'projection.account.type'='enum', 
  'projection.account.values'='111122223334444,5555666677778888',  
  'storage.location.template'='s3://doc_example_bucket/AWSLogs/${account}/CloudTrail/${region}/${year}/${month}/${day}'
)

Be sure to replace the following variables in the preceding query:

  • ctrail_pp_ymd with the name of the CloudTrail table.
  • doc_example_bucket with the name of the Amazon Simple Storage Service (Amazon S3) bucket where you want to create the CloudTrail table.
  • 1111222233334444 and 5555666677778888 with the account IDs of accounts that you want to create the CloudTrail table for.
  • us-east-1,us-east-2,us-west-2 with the AWS Region that you want to create the CloudTrail table for.
  • Table attributes and properties based on your use case.
  • Projection ranges based on your use case. For example, if your CloudTrail data is available only from year 2018, then replace the projection range for partition column year with '2018,2021'.

To create a CloudTrail table for multiple accounts under the same organization, use a command similar to the following:

CREATE EXTERNAL TABLE ctrail_pp_ymd_org (eventversion STRING,
useridentity STRUCT<
               type:STRING,
               principalid:STRING,
               arn:STRING,
               accountid:STRING,
               invokedby:STRING,
               accesskeyid:STRING,
               userName:STRING,
sessioncontext:STRUCT<
attributes:STRUCT<
               mfaauthenticated:STRING,
               creationdate:STRING>,
sessionissuer:STRUCT<
               type:STRING,
               principalId:STRING,
               arn:STRING,
               accountId:STRING,
               userName:STRING>>>,
eventtime STRING,
eventsource STRING,
eventname STRING,
awsregion STRING,
sourceipaddress STRING,
useragent STRING,
errorcode STRING,
errormessage STRING,
requestparameters STRING,
responseelements STRING,
additionaleventdata STRING,
requestid STRING,
eventid STRING,
resources ARRAY<STRUCT<
               ARN:STRING,
               accountId:STRING,
               type:STRING>>,
eventtype STRING,
apiversion STRING,
readonly STRING,
recipientaccountid STRING,
serviceeventdetails STRING,
sharedeventid STRING,
vpcendpointid STRING
)
PARTITIONED BY (account string, region string, year string, month string, day string)
ROW FORMAT SERDE
  'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT
  'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://doc_example_bucket/AWSLogs/doc_example_orgID/'
TBLPROPERTIES (
  'projection.enabled'='true',
  'projection.day.type'='integer',
  'projection.day.range'='01,31',
  'projection.day.digits'='2',
  'projection.month.type'='integer',
  'projection.month.range'='01,12',
  'projection.month.digits'='2',
  'projection.region.type'='enum',
  'projection.region.values'='us-east-1,us-east-2,us-west-2',
  'projection.year.type'='integer',
  'projection.year.range'='2010,2100',
  'projection.account.type'='enum', 
  'projection.account.values'='111122223334444,5555666677778888',  
  'storage.location.template'='s3://doc_example_bucket/AWSLogs/doc_example_orgID/${account}/CloudTrail/${region}/${year}/${month}/${day}'
)

Note: If you need to query the CloudTrail data before year 2010, then be sure to update the year range in the projection.year.range property.

Be sure to replace the following variables in the preceding query:

  • ctrail_pp_ymd_org with the name of the CloudTrail table.
  • doc_example_bucket with the name of the Amazon S3 bucket where you want to create the CloudTrail table.
  • doc_example_orgID with your organization ID.
  • 1111222233334444 and 5555666677778888 with the account IDs of accounts that you want to create the CloudTrail table for
  • us-east-1, us-east-2, and us-west-2 with the Regions where you want to create the CloudTrail table.
  • Table attributes and properties based on your use case.
  • Projection ranges based on your use case. For example, if your CloudTrail data is available only from year 2018, then replace the projection range for partition column year with '2018,2021'.

When you run your queries, be sure to include restrictive conditions on the partition columns in your queries. This allows Athena to scan less data and decreases query process time.

For example, you can run a command similar to the following to find out which user made the GetObject request to the S3 bucket. The table in this query uses the year, month, and day partition format.

Note: Be sure to have event logging activated for Amazon S3 in CloudTrail.

SELECT useridentity.arn, eventtime  FROM "ctrail_pp_ymd"where eventname = 'GetObject'
and year = '2021'
and month = '05'
and region = 'us-east-1'
and cast(json_extract(requestparameters, '$.bucketName')as varchar) ='doc_example_bucket'

Be sure to replace the following variables in the preceding query:

  • ctrail_pp_ymd with the name of the CloudTrail table.
  • doc_example_bucket with the name of the S3 bucket where you want to create the CloudTrail table.
  • Restrictive conditions based on your use case.

If you receive timeout issues, then see How do I troubleshoot timeout issues when I query CloudTrail data using Athena?

For more information, see Querying AWS CloudTrail logs and How can I create and use partitioned tables in Athena?

Related information

How do I resolve the partition projection error "INVALID_TABLE_PROPERTY" when I query a table in Athena?

Troubleshooting in Athena

AWS OFFICIAL
AWS OFFICIALUpdated 2 months ago