Amazon SageMaker Lakehouse FAQs
General
What is Amazon SageMaker Lakehouse?
Amazon SageMaker Lakehouse unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data with all Apache Iceberg–compatible tools and engines. Secure your data in the lakehouse by defining permissions, which are enforced across all analytics and machine learning (ML) tools and engines. Bring data from operational databases and applications into your lakehouse in near real time through zero-ETL integrations. Additionally, access and query data in-place with federated query capabilities across third-party data sources.
What are the benefits of SageMaker Lakehouse?
SageMaker Lakehouse:
a) Reduces data silos by providing unified access to your data across Amazon S3 data lakes and Amazon Redshift data warehouses. Data from operational databases and applications can be ingested into your lakehouse in near real time for analytics and ML with no-code or low-code extract, transform, and load (ETL) pipelines. You can also use hundreds of connectors and 13 federated query capabilities to access data from AWS and sources outside of AWS.
b) Gives you the flexibility to access and query all your data in-place, from a wide range of AWS services and open source and third-party tools and engines compatible with Apache Iceberg. You can use analytic tools and engines of your choice such as SQL, Apache Spark, business intelligence (BI), and AI/ML tools, and collaborate with a single copy of data stored across Amazon S3 or Amazon Redshift.
c) Improves enterprise security with a built-in access control mechanism that secures your data when accessed from integrated AWS services, such as Amazon Redshift, Amazon Athena, or Amazon EMR, or third-party Apache Iceberg–compatible engines.
How does SageMaker Lakehouse work?
SageMaker Lakehouse is directly accessible from Amazon SageMaker Unified Studio (preview). Data from different sources are organized in logical containers called catalogs in SageMaker Lakehouse. Each catalog represents data either from existing data sources such as Amazon Redshift data warehouses, data lakes, or databases. New catalogs can be directly created in the lakehouse to store data in Amazon S3 or Amazon Redshift Managed Storage (RMS). Data in SageMaker Lakehouse can be accessed from Apache Iceberg–compatible engine such as Apache Spark, Athena, or Amazon EMR. Additionally, these catalogs can be discovered as databases in Amazon Redshift data warehouses, allowing you to use your SQL tools and analyze your lakehouse data.
Capabilities
How does SageMaker Lakehouse deliver unified access control to data?
SageMaker Lakehouse unifies access control to your data with two capabilities: 1) SageMaker Lakehouse allows you to define fine-grained permissions. These permissions get enforced by query engines such as Amazon EMR, Athena, and Amazon Redshift. 2) SageMaker Lakehouse allows you to get in-place access to your data, removing the need for making data copies. You can maintain a single copy of data and a single set of access control policies to benefit from unified fine-grained access control in SageMaker Lakehouse.
How does SageMaker Lakehouse work with existing AWS services such as AWS Glue Data Catalog, AWS Lake Formation, and Amazon Redshift?
SageMaker Lakehouse is built on multiple technical catalogs across AWS Glue Data Catalog, Lake Formation, and Amazon Redshift to provide unified data access across data lakes and data warehouses. SageMaker Lakehouse uses AWS Glue Data Catalog and Lake Formation to store table definitions and permissions. Lake Formation fine-grained permissions are available to tables defined in SageMaker Lakehouse. You can manage your table definitions in AWS Glue Data Catalog and define fine-grained permissions, such as table-level, column-level, and cell-level permissions, to secure your data. In addition, using the cross-account data-sharing capabilities, you can enable zero-copy data sharing to make data available for secure collaboration.
Do I need any client software to access Apache Iceberg APIs provided by SageMaker Lakehouse?
Yes. The open source Apache Iceberg client library is required to access SageMaker Lakehouse. Customers using third-party or self-managed open source engines such as Apache Spark or Trino need to include the Apache Iceberg client library in their query engines to access SageMaker Lakehouse.
Can I use SageMaker Lakehouse to write data to my Amazon Redshift data warehouse using Apache Spark?
Yes, using an Apache Iceberg client library, you can read and write data to your existing Amazon Redshift from Apache Spark engines on AWS services such as Amazon EMR, AWS Glue, Athena, and Amazon SageMaker or the third-party Apache Spark. However, you must have appropriate write permissions on the tables to write data to them.
Can I join my data lake and Amazon Redshift data warehouse tables on SageMaker Lakehouse?
Yes, you can join your data lake tables on Amazon S3 with the tables in your Amazon Redshift data warehouse across multiple databases using an engine of your choice, such as Apache Spark.
Migration
Do I need to migrate my data to use SageMaker Lakehouse?
No, you don't have to migrate your data to use SageMaker Lakehouse. SageMaker Lakehouse allows you to access and query your data in-place, with the open standard of Apache Iceberg. You can directly access your data in Amazon S3 data lakes and Amazon Redshift data warehouses. Data from operational databases and applications can be ingested to the lakehouse in near real time through available zero-ETL integrations, without maintaining infrastructure or complex pipelines. You can also use federated query capabilities to access your in-place data. In addition to these, you can use hundreds of AWS Glue connectors to integrate with your existing data sources.
I currently use Amazon Redshift. How can I bring my Amazon Redshift data warehouse to SageMaker Lakehouse?
If you are already an Amazon Redshift user, you can register your Amazon Redshift data warehouse with SageMaker Lakehouse in a few easy steps and without migrating your data. Follow the steps in the developer guide.
I currently use an Amazon S3 data lake. How can I bring my data lake to SageMaker Lakehouse?
If you have configured your Amazon S3 data lake using AWS Glue Data Catalog, you don't need to make any changes.
Zero-ETL integrations
What are the different zero-ETL integrations available with SageMaker Lakehouse?
SageMaker Lakehouse enables support for zero-ETL integrations with Amazon DynamoDB, Amazon Aurora, and Amazon RDS for MySQL, and eight applications: Zoho CRM, Salesforce, Salesforce Pardot, ServiceNow, Facebook ads, Instagram ads, Zendesk, and SAP.
How do I access zero-ETL integrations with SageMaker Lakehouse?
You can configure and monitor your zero-ETL integrations through the AWS Glue console within Amazon SageMaker Data Processing with AWS Glue. Once the data is ingested, you can access and query the data from Apache Iceberg–compatible query engines. For more details, visit Zero-ETL integrations.
What is the pricing model for zero-ETL?
To learn more about pricing, visit the SageMaker Lakehouse and AWS Glue pricing pages.
Pricing
What is the pricing for SageMaker Lakehouse?
Visit SageMaker Lakehouse pricing for details.
Availability
In which AWS Regions is SageMaker Lakehouse available?
SageMaker Lakehouse is available in US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Hong Kong), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Stockholm), and South America (Sao Paulo).
Does SageMaker Lakehouse offer an SLA?
Yes. SageMaker Lakehouse stores metadata in AWS Glue Data Catalog and offers the same SLA as Amazon Glue.
Getting started
How do I get started using SageMaker Lakehouse?
To get started, you can log into your SageMaker domain using your corporate (for example, Okta) credentials on SageMaker Unified Studio. In a few short steps in SageMaker Unified Studio, administrators can create projects by choosing a specific project profile. You can then choose a project to work with the SageMaker Lakehouse. Once a project is selected, you get a unified view of data, query engines, and developer tools in one place. Users such as data engineers and data analysts can then query the data by using a tool of their choice. For example, when a data engineer uses a notebook and issues a Spark command to list tables, they discover all data warehouse and data lake tables they have access to. They can then run commands to read and write data into the tables that are physically stored either in Amazon S3 data lakes or Amazon Redshift data warehouses. Similarly, when a data analyst runs Amazon Redshift SQL commands from a SQL editor, they get the same unified view of data and can read and write data to these tables. From your preferred tools (SQL editor or notebook), you can create new tables in Amazon S3 or Amazon Redshift. Query Amazon Redshift materialized views to accelerate performance on your data lake tables. In addition to the SageMaker Unified Studio, SageMaker Lakehouse is also accessible from the AWS Management Console, AWS Glue APIs, AWS Command Line Interface (AWS CLI), or AWS SDKs. For more details, visit Documentation page.