What is Data Preparation?
Data preparation is the process of preparing raw data so that it is suitable for further processing and analysis. Key steps include collecting, cleaning, and labeling raw data into a form suitable for machine learning (ML) algorithms and then exploring and visualizing the data. Data preparation can take up to 80% of the time spent on an ML project. Using specialized data preparation tools is important to optimize this process.
What is the connection between ML and data preparation?
Data flows through organizations like never before, arriving from everything from smartphones to smart cities as both structured data and unstructured data (images, documents, geospatial data, and more). Unstructured data makes up 80% of data today. ML can analyze not just structured data, but also discover patterns in unstructured data. ML is the process where a computer learns to interpret data and make decisions and recommendations based on that data. During the learning process¬—and later when used to make predictions—incorrect, biased, or incomplete data can result in inaccurate predictions.
Why is data preparation important for ML?
Data fuels ML. Harnessing this data to reinvent your business, while challenging, is imperative to staying relevant now and in the future. It is survival of the most informed, and those who can put their data to work to make better, more informed decisions respond faster to the unexpected and uncover new opportunities. This important yet tedious process is a prerequisite for building accurate ML models and analytics, and it is the most time-consuming part of an ML project. To minimize this time investment, data scientists can use tools that help automate data preparation in various ways.
How do you prepare your data?
Data preparation follows a series of steps that starts with collecting the right data, followed by cleaning, labeling, and then validation and visualization.
Collect data
Collecting data is the process of assembling all the data you need for ML. Data collection can be tedious because data resides in many data sources, including on laptops, in data warehouses, in the cloud, inside applications, and on devices. Finding ways to connect to different data sources can be challenging. Data volumes are also increasing exponentially, so there is a lot of data to search through. Additionally, data has vastly different formats and types depending on the source. For example, video data and tabular data are not easy to use together.
Clean data
Cleaning data corrects errors and fills in missing data as a step to ensure data quality. After you have clean data, you will need to transform it into a consistent, readable format. This process can include changing field formats like dates and currency, modifying naming conventions, and correcting values and units of measure so they are consistent.
Label data
Data labeling is the process of identifying raw data (images, text files, videos, and so on) and adding one or more meaningful and informative labels to provide context so an ML model can learn from it. For example, labels might indicate if a photo contains a bird or car, which words were mentioned in an audio recording, or if an X-ray discovered an irregularity. Data labeling is required for various use cases, including computer vision, natural language processing, and speech recognition.
Validate and visualize
After data is cleaned and labeled, ML teams often explore the data to make sure it is correct and ready for ML. Visualizations like histograms, scatter plots, box and whisker plots, line plots, and bar charts are all useful tools to confirm data is correct. Additionally, visualizations also help data science teams complete exploratory data analysis. This process uses visualizations to discover patterns, spot anomalies, test a hypothesis, or check assumptions. Exploratory data analysis does not require formal modeling; instead, data science teams can use visualizations to decipher the data.
How can AWS help?
Amazon SageMaker data preparation tools help organizations gain insights from both structured and unstructured data. For instance, you can use Amazon SageMaker Data Wrangler to simplify structured data preparation with built-in data visualizations through a no-code visual interface. SageMaker Data Wrangler includes over 300 built-in data transformations, so you can quickly normalize, transform, and combine features without writing any code. You can also bring your custom transformations in Python or Apache Spark, if you prefer. For unstructured data, you need large high-quality, labeled datasets. Using Amazon SageMaker Ground Truth Plus, you can build high-quality ML training datasets while reducing data labeling costs by up to 40% without having to build labeling applications or manage a labeling workforce on your own.
For analysts or business users who prefer preparing data inside a notebook, you can visually browse, discover, and connect to Spark data processing environments running on Amazon EMR from your Amazon SageMaker Studio notebooks with a few clicks. Once connected, you can interactively query, explore, and visualize data, and run Spark jobs using the language of your choice (SQL, Python, or Scala) to build complete data preparation and ML workflows.