Amazon SageMaker Lakehouse FAQs

General

Amazon SageMaker Lakehouse unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data with all Apache Iceberg–compatible tools and engines. Secure your data in the lakehouse by defining permissions, which are enforced across all analytics and machine learning (ML) tools and engines. Bring data from operational databases and applications into your lakehouse in near real time through zero-ETL integrations. Additionally, access and query data in-place with federated query capabilities across third-party data sources.

SageMaker Lakehouse:

a) Reduces data silos by providing unified access to your data across Amazon S3 data lakes and Amazon Redshift data warehouses. Data from operational databases and applications can be ingested into your lakehouse in near real time for analytics and ML with no-code or low-code extract, transform, and load (ETL) pipelines. You can also use hundreds of connectors and 13 federated query capabilities to access data from AWS and sources outside of AWS.

b) Gives you the flexibility to access and query all your data in-place, from a wide range of AWS services and open source and third-party tools and engines compatible with Apache Iceberg. You can use analytic tools and engines of your choice such as SQL, Apache Spark, business intelligence (BI), and AI/ML tools, and collaborate with a single copy of data stored across Amazon S3 or Amazon Redshift.

c) Improves enterprise security with a built-in access control mechanism that secures your data when accessed from integrated AWS services, such as Amazon Redshift, Amazon Athena, or Amazon EMR, or third-party Apache Iceberg–compatible engines.

SageMaker Lakehouse is directly accessible from Amazon SageMaker Unified Studio (preview). Data from different sources are organized in logical containers called catalogs in SageMaker Lakehouse. Each catalog represents data either from existing data sources such as Amazon Redshift data warehouses, data lakes, or databases. New catalogs can be directly created in the lakehouse to store data in Amazon S3 or Amazon Redshift Managed Storage (RMS). Data in SageMaker Lakehouse can be accessed from Apache Iceberg–compatible engine such as Apache Spark, Athena, or Amazon EMR. Additionally, these catalogs can be discovered as databases in Amazon Redshift data warehouses, allowing you to use your SQL tools and analyze your lakehouse data.

Capabilities

SageMaker Lakehouse unifies access control to your data with two capabilities: 1) SageMaker Lakehouse allows you to define fine-grained permissions. These permissions get enforced by query engines such as Amazon EMR, Athena, and Amazon Redshift. 2) SageMaker Lakehouse allows you to get in-place access to your data, removing the need for making data copies. You can maintain a single copy of data and a single set of access control policies to benefit from unified fine-grained access control in SageMaker Lakehouse.

SageMaker Lakehouse is built on multiple technical catalogs across AWS Glue Data Catalog, Lake Formation, and Amazon Redshift to provide unified data access across data lakes and data warehouses. SageMaker Lakehouse uses AWS Glue Data Catalog and Lake Formation to store table definitions and permissions. Lake Formation fine-grained permissions are available to tables defined in SageMaker Lakehouse. You can manage your table definitions in AWS Glue Data Catalog and define fine-grained permissions, such as table-level, column-level, and cell-level permissions, to secure your data. In addition, using the cross-account data-sharing capabilities, you can enable zero-copy data sharing to make data available for secure collaboration.

Yes. The open source Apache Iceberg client library is required to access SageMaker Lakehouse. Customers using third-party or self-managed open source engines such as Apache Spark or Trino need to include the Apache Iceberg client library in their query engines to access SageMaker Lakehouse.

Yes, using an Apache Iceberg client library, you can read and write data to your existing Amazon Redshift from Apache Spark engines on AWS services such as Amazon EMR, AWS Glue,  Athena, and Amazon SageMaker or the third-party Apache Spark. However, you must have appropriate write permissions on the tables to write data to them.

Yes, you can join your data lake tables on Amazon S3 with the tables in your Amazon Redshift data warehouse across multiple databases using an engine of your choice, such as Apache Spark.

Migration

No, you don't have to migrate your data to use SageMaker Lakehouse. SageMaker Lakehouse allows you to access and query your data in-place, with the open standard of Apache Iceberg. You can directly access your data in Amazon S3 data lakes and Amazon Redshift data warehouses. Data from operational databases and applications can be ingested to the lakehouse in near real time through available zero-ETL integrations, without maintaining infrastructure or complex pipelines. You can also use federated query capabilities to access your in-place data. In addition to these, you can use hundreds of AWS Glue connectors to integrate with your existing data sources.

If you are already an Amazon Redshift user, you can register your Amazon Redshift data warehouse with SageMaker Lakehouse in a few easy steps and without migrating your data. Follow the steps in the developer guide.

If you have configured your Amazon S3 data lake using AWS Glue Data Catalog, you don't need to make any changes.

Zero-ETL integrations

SageMaker Lakehouse enables support for zero-ETL integrations with Amazon DynamoDB, Amazon Aurora, and Amazon RDS for MySQL, and eight applications: Zoho CRM, Salesforce, Salesforce Pardot, ServiceNow, Facebook ads, Instagram ads, Zendesk, and SAP.

You can configure and monitor your zero-ETL integrations through the AWS Glue console within Amazon SageMaker Data Processing with AWS Glue. Once the data is ingested, you can access and query the data from Apache Iceberg–compatible query engines. For more details, visit Zero-ETL integrations.

To learn more about pricing, visit the SageMaker Lakehouse and AWS Glue pricing pages.

Pricing

Visit SageMaker Lakehouse pricing for details.

Availability

SageMaker Lakehouse is available in US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Hong Kong), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Stockholm), and South America (Sao Paulo).

Yes. SageMaker Lakehouse stores metadata in AWS Glue Data Catalog and offers the same SLA as Amazon Glue.

Getting started

To get started, you can log into your SageMaker domain using your corporate (for example, Okta) credentials on SageMaker Unified Studio. In a few short steps in SageMaker Unified Studio, administrators can create projects by choosing a specific project profile. You can then choose a project to work with the SageMaker Lakehouse. Once a project is selected, you get a unified view of data, query engines, and developer tools in one place. Users such as data engineers and data analysts can then query the data by using a tool of their choice. For example, when a data engineer uses a notebook and issues a Spark command to list tables, they discover all data warehouse and data lake tables they have access to. They can then run commands to read and write data into the tables that are physically stored either in Amazon S3 data lakes or Amazon Redshift data warehouses. Similarly, when a data analyst runs Amazon Redshift SQL commands from a SQL editor, they get the same unified view of data and can read and write data to these tables. From your preferred tools (SQL editor or notebook), you can create new tables in Amazon S3 or Amazon Redshift. Query Amazon Redshift materialized views to accelerate performance on your data lake tables. In addition to the SageMaker Unified Studio, SageMaker Lakehouse is also accessible from the AWS Management Console, AWS Glue APIs, AWS Command Line Interface (AWS CLI), or AWS SDKs. For more details, visit Documentation page.