I'm querying a table in Amazon Athena. I want to find out which Amazon Simple Storage Service (Amazon S3) file is the source for each row in the output, or which rows correspond to a specific file.
Short description
- To find out which Amazon S3 file contains data that is returned by a specific row in an Athena table, run a SELECT $path query.
- To find out which Athena table rows are associated with a specific Amazon S3 file, run a SELECT query with the WHERE $path condition.
These queries are useful for investigating unexpected data and building queries that require information about the source data.
Resolution
Before you run the following queries, be sure that you have the following:
- An Athena table. If you don't have a table, run a CREATE TABLE statement.
- An AWS Identity and Access Management (IAM) user or role that has permissions to run Athena queries.
To find the S3 file that's associated with a row of an Athena table:
1. Run a SELECT query against your table to return the data that you want:
SELECT * FROM "my_database"."my_table" WHERE year=2019;
2. To find the Amazon S3 source file for the data, run a query similar to the following:
SELECT "$path" FROM "my_database"."my_table" WHERE year=2019;
The query returns the Amazon S3 path for the data:
s3://awsexamplebucket/datasets_mytable/year=2019/data_file1.json
To return Athena table rows that originate from a specific Amazon S3 file:
1. Confirm the name and location of the Amazon S3 object that you want to retrieve rows for. If you don't know the object name and location, run the ls command to list objects under a specific S3 bucket or prefix. In the following example, replace s3://awsexamplebucket/my_table/my_partition/ with your partition path.
aws s3 ls s3://awsexamplebucket/my_table/my_partition/
2. In Athena, run a SELECT query against your table to return data that's associated with the Amazon S3 file:
SELECT *,"$path" FROM "my_database"."my_table" WHERE regexp_like("$path", 's3://awsexamplebucket/my_table/my_partition/file-01.csv')
The query returns data that matches the Amazon S3 path:
id name year $path
3 John 1999 's3://awsexamplebucket/my_table/my_partition/file-01.csv'
4 Jane 2000 's3://awsexamplebucket/my_table/my_partition/file-01.csv'
This query also supports some wildcard functionality. For example, to return all data that contains part of a file name, run a query like this:
SELECT *,"$path" FROM "my_database"."my_table" WHERE regexp_like("$path", '\-02')
This query returns all rows with file names that contain -02:
id name year $path
13 Mia 2009 's3://awsexamplebucket/my_table/my_partition/file-02.csv'
14 Mary 2010 's3://awsexamplebucket/my_table/my_partition/file-02.csv'
15 Max 2011 's3://awsexamplebucket/my_table/my_partition/file-02.csv'
16 Merriam 2012 's3://awsexamplebucket/my_table/my_partition/file-02.csv'
Related information
Parquet SerDe
Connecting to data sources