in

Mastering AWS Athena: The Complete Guide for Data Analysts

default image

As a data analyst, I regularly need to analyze large datasets in S3 across various projects. But setting up and managing servers just to query data is a pain. AWS Athena has become an indispensable tool in my toolbox for its serverless approach to querying S3 data using standard SQL.

In this comprehensive 3200+ word guide, I‘ll share my insights as a practicing data analyst on how to master AWS Athena, so you can stop worrying about infrastructure and focus on uncovering insights from data.

What is Athena and When Should You Use It?

Athena lets you analyze data in S3 on-demand using SQL without servers or infrastructure. Behind the scenes, it uses Presto as the distributed SQL query engine to execute queries in parallel across potentially thousands of serversless compute resources.

It works seamlessly with AWS Glue Data Catalog for schema and metadata management. The Glue crawlers automatically scan data and infer schemas – you don‘t need to define schema manually.

As a serverless service, Athena auto-scales query execution based on data size and complexity. You don’t have to provision instances or manage servers.

Athena is pay-per-query – you only pay for the queries executed based on volume of data scanned. There is no charge for DDL operations like CREATE/DROP tables.

Based on my experience, I recommend using Athena when:

  • You need to analyze datasets in S3 using ad-hoc, interactive SQL queries.
  • Your queries are exploratory in nature for data discovery and profiling.
  • You want to run quick one-off queries on logs, datasets without standing up servers.
  • You have a dynamic workload with unpredictable query patterns.
  • You want to operationalize dashboards and BI tools on S3 data cost-effectively.

Athena handles the heavy lifting of executing SQL at scale against S3 data so you can focus on your analysis rather than managing infrastructure.

Next let‘s do a quick hands-on tutorial of Athena in action.

Getting Started with Athena – A Practical Tutorial

I‘ve uploaded a sample employee.csv file to an S3 bucket athena-demo-data with some fake employee data. Let‘s create a table mapped to this data and run some queries.

Step 1 – Create the Table Mapped to S3 Data

Go to Athena console and run this DDL:

CREATE EXTERNAL TABLE employees (
  `id` int,
  `name` string, 
  `dept` string, 
  `salary` int
)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe‘
WITH SERDEPROPERTIES (
  ‘serialization.format‘ = ‘,‘,
  ‘field.delim‘ = ‘,‘
)  
LOCATION ‘s3://athena-demo-data/employees/‘;

This will create a table employees mapped to the CSV file in S3. The columns and data types are inferred automatically by the Glue catalog.

Step 2 – Execute Queries

Now I can execute SQL queries against this table:

-- Count employees
SELECT COUNT(*) FROM employees; 

-- Get average salary
SELECT AVG(salary) FROM employees;

-- Find top 5 salaries
SELECT * FROM employees
ORDER BY salary DESC LIMIT 5;

Athena executes these interactive queries in seconds against the S3 data. Pretty cool!

Step 3 – Visualize Results

I can visualize Athena query results easily in QuickSight reports and dashboards since it saves output results to S3. For example, here is a simple bar chart built in QuickSight showing average salary by department:

Athena QuickSight Dashboard

Now that you‘ve got a hands-on feel, let‘s dive deeper into features, use cases, performance tuning and limitations.

Key Features and Functionality

Here are some of the key features and capabilities that make Athena so useful for data analysis:

Standard SQL Support

Athena allows querying data with standard SQL syntax. It also extends SQL with useful Presto functions. This makes adoption easy if you know basic SQL.

Serverless at Scale

The serverless architecture auto-scales query execution based on data size and complexity. You can query petabytes of data without managing infrastructure.

Support for Common Data Formats

Process data in open formats like CSV, JSON, Parquet, ORC, Avro among others. Compression codecs like Snappy, Zlib, and Gzip are supported.

Integration with Glue Catalog

Leverages Glue crawlers to automatically infer schema and populate metadata for data in S3. Provides information about partitions and layout.

Cost-effective Pricing

Pay only for the queries you run based on volume of data scanned. Athena costs around $5 per TB as of early 2023. DDL statements are free.

Fast Performance

Athena adopts columnar processing, lazy evaluation, and caching optimizations enabling most queries to return in seconds. DDL commands also execute quickly.

Security

Supports IAM access control, encryption of data at rest and in transit, VPC endpoints to access Athena securely from your private network.

Query Federation

Athena can query data across S3, JDBC sources like RDS & Redshift, and Lambda functions out of the box. Support for EMR and MSK coming soon.

These capabilities make Athena a versatile tool for data analysis scenarios. Next let‘s look at how it can supercharge real-world use cases.

Common Use Cases and Applications

Based on my experience, here are some of the most common ways I leverage Athena for data analysis and processing:

Ad-hoc Exploratory Queries

Perform one-off queries for data discovery – understand what data you have, profile datasets, find relationships and outliers.

-- Check distribution of values
SELECT col1, COUNT(*) c 
FROM my_table
GROUP BY col1
ORDER BY c DESC;

Log Analytics

Analyze web server logs, clickstream data, ad impression logs, etc. for usage trends and funnel analysis.

-- Find 404 errors 
SELECT timestamp, request_url, http_status
FROM log_table 
WHERE http_status = ‘404‘;

Business Intelligence

Connect BI tools like QuickSight to Athena to build interactive dashboards and reports for business users.

-- Revenue by segment
SELECT segment, SUM(revenue) rev
FROM sales_table
GROUP BY segment;

Machine Learning

Transform data and generate features for model training using Athena SQL queries instead of custom code.

-- Generate features
SELECT userid, AVG(salary) avg_sal, COUNT(sessions) num_sessions
FROM users u
JOIN clicks c on u.userid = c.userid
GROUP BY userid;

ETL on S3 Data

Use Athena for ETL-like workflows – filter, transform, join, aggregate data sets before further processing.

CREATE TABLE filtered_users WITH (format = ‘Parquet‘) AS 
SELECT * FROM users 
WHERE age > 25 AND country = ‘USA‘;

Athena makes these data analytics workflows easier without moving data out of S3 into specialized warehousing systems.

But to get optimal performance and lower costs, you need to tune Athena properly. Let‘s go over some key best practices.

Performance Optimization Tips and Tricks

Based on extensive usage, I‘ve compiled some handy tips to optimize Athena performance and lower your costs:

  • Use columnar formats like Parquet – Improves scan efficiency due to column-oriented storage.

  • Enable compression – Snappy or Zlib compression reduces data scanned. But balance with compute overhead.

  • Partition tables – Avoid scanning entire dataset by filtering on partitions.

  • Manage file size – About 1 GB per file is ideal, tune split size. Multiple smaller files hurt performance.

  • Use separate buckets – Store tables in different buckets to allow parallel scan.

  • Collect statistics – Run ANALYZE to gather table and column stats for query optimization.

  • Tune split size – Adjust max split size based on data characteristics to improve parallelism.

  • Create Views – For common expressions or joins, create a View rather than repeating logic.

  • Use CTAS – For iterative queries, use CTAS to write result to new table rather than rescanning source repeatedly.

  • Monitor queries – Identify expensive and slow queries using CloudWatch metrics to optimize.

  • Cache hot data – Use S3 caching or Redshift Spectrum to cache frequently queried data.

By following these performance best practices, you can optimize Athena to provide fast response times on large datasets and minimize costs.

Let‘s take a quick look at the pricing model and how to estimate costs.

Athena Pricing and Cost Management

Athena follows serverless pay-per-query pricing. As of Jan 2023, it costs $5 per TB of data scanned by each query. There is no charge for DDL statements.

It offers 1TB of free scans per month – sufficient for personal learning but negligible for production workloads.

To estimate your Athena costs:

  • Check past usage – The console shows data scanned and cost for each query run historically.

  • Monitor usage – Set up CloudWatch metrics and alarms to monitor usage trends.

  • Estimate frequently used queries – For common queries, calculate expected data scan using table sizes.

  • Use projections – When exploring data, use SELECT projections instead of SELECT * to minimize scanned data.

  • Manage user access – Use workgroups to restrict access and allocate query costs by department.

  • Right-size resources – Don‘t over-provision nodes for EMR, Glue jobs triggered by Athena.

By following the best practices outlined earlier and optimizing your most common and expensive queries, you can reduce Athena costs significantly.

Now let‘s compare Athena to other query solutions on AWS.

Athena vs. Redshift vs. EMR – Comparing AWS Query Services

Redshift, EMR, and Athena can all be used for data processing and analytics on AWS. But they have different strengths based on the use case:

Redshift is ideal for traditional data warehousing workloads. It provides a managed PB-scale MPP analytics database optimized for complex SQL analytics. But requires provisioning and management of clusters.

EMR provides a managed Hadoop/Spark framework optimized for heavy batch processing workloads and ETL pipelines. Useful for data engineering tasks. Requires cluster setup and job tuning.

Athena shines for ad-hoc, interactive querying directly on S3 data without infrastructure. Pay-per-query pricing model makes it very cost-effective for querying data lakes.

I prefer Athena for its serverless capabilities and pay-per-query pricing. But pick the technology that best matches your specific workload and use case. Athena complements other AWS analytics services nicely.

Key Limitations to be Aware Of

While Athena is extremely useful, it does have some limitations:

  • Row size limit – 1 MB row size limit can cause errors for very wide datasets. Workaround is to split into multiple rows.

  • No DML support – Cannot perform INSERT, UPDATE or DELETE operations. Alternative is to create new table using CTAS.

  • Serverless latency – Being serverless, first query execution can take longer as resources spin up.

  • No stored procedures – Cannot create parameterized procedures, use views for reusability.

  • Limited data types – Some SQL data types like arrays and geospatial not supported.

  • No Indexing – Cannot create indexes to optimize performance unlike databases.

The benefits of Athena generally outweigh these limitations for ad-hoc analytical querying. For OLTP workflows involving frequent inserts and updates, alternatives like Redshift would be better suited.

Key Takeaways and Conclusion

Let me summarize the key takeaways from this comprehensive guide to mastering AWS Athena:

  • Athena enables ad-hoc SQL querying on S3 data without infrastructure overhead
  • Ideal for exploratory analysis, BI dashboards, log analytics use cases
  • Pay only for the queries executed based on data scanned
  • Follow best practices like partitioning, compression to optimize performance and lower costs
  • Limitations include lack of DML support and maximum row sizes
  • Complements Redshift, EMR for other analytical workloads on AWS

I hope this guide has armed you with the right mental models to be able to leverage Athena effectively for unlocking insights from data in S3.

With the knowledge of Athena‘s capabilities, performance tuning tricks and limitations, you can make smarter decisions on when to use Athena vs other tools for your analytics use cases on AWS.

Happy querying!

AlexisKestler

Written by Alexis Kestler

A female web designer and programmer - Now is a 36-year IT professional with over 15 years of experience living in NorCal. I enjoy keeping my feet wet in the world of technology through reading, working, and researching topics that pique my interest.