Data Wrangling: Selecting Data

In the world of data analytics, we often work with large datasets containing loads of information. Before we get into analyzing, it’s important to prepare our data by selecting only the relevant rows and columns, applying sampling techniques, and more.

Subsetting Rows and Columns

Subsetting simply means selecting a smaller part of the dataset by focusing on specific rows or columns. This process can make our data more manageable and focus on the problem.

For example, imagine you run a chain of grocery stores and have a dataset containing sales information for all products across all your stores. If you’re only interested in analyzing the sales data of your fruits and vegetables, there’s no need to use the entire dataset.

  • Subsetting Columns: 
    To focus on the columns we’re interested in, we can select just the product name, category, and sales data columns. Doing this removes any irrelevant information, such as the aisle number or shelf location. Note that columns usually contain information about key variables, sometimes called the unit of analysis or target variable, in this case, fruit and vegetable sales.
  • Subsetting Rows: 
    To continue refining our dataset, we can then subset rows based on the category. In our fruits and vegetables sales example, we’ll filter our dataset to only include rows with “Fruits” or “Vegetables” in the category column. Now, we have a dataset that only contains the data we need to analyze the sales of fruits and vegetables. Note that subsetting columns and rows leave us with just the variables (columns) and observations (rows) the analysis requires.

 

Sampling

In some cases, our dataset might be too large to work with or analyze as a whole. Sampling is an important technique that allows us to work with a smaller, representative portion of the data without losing the overall structure and patterns.

Let’s say that in addition to the sales data for our grocery stores, we also have access to a vast pool of customer feedback. Instead of analyzing every single piece of feedback, we can use sampling to select a representative subset of customer comments to analyze. 

Here are a few common sampling methods: 

  • Random sampling involves randomly selecting a smaller group of data points, ensuring each has an equal chance of being chosen. This method can ensure that our sample is unbiased and representative of the whole population. 
  • In stratified sampling, we divide our dataset into different subgroups, or strata, based on a specific characteristic (e.g., age, gender, or location). We then randomly sample from each stratum to ensure that our final sample includes a proportional representation of each subgroup. 
  • Cluster sampling works by dividing the data into clusters (usually based on a specific characteristic) and then selecting a number of clusters at random to sample. This method is particularly useful when our data is spread across multiple locations, as it reduces the time and resources needed to collect the data compared to random sampling.

 

Handling Missing Data

In real-world datasets, missing data is a common issue. When selecting data, we should also be aware of any gaps in our dataset and decide how to handle them. 

There are several ways to handle missing data, such as: 

  • Deleting rows or columns with missing data.
  • Filling in the missing data with an estimate, like the average or median value of the available data. 
  • Filling in the missing data using an algorithm, such as linear regression or machine learning techniques.

 

In our grocery store example, we might find some sales data missing for certain days or locations. Depending on the situation, we can decide how to handle this missing data to avoid adversely affecting our analysis.


Related Tags: