MongoDB Aggregation Pipeline

MongoDB Aggregation Pipeline

A Practical Guide

ยท

14 min read

Table of contents

Introduction

MongoDB's Aggregation Pipeline is a powerful tool for processing and transforming data within the database. It provides a flexible and expressive way to perform operations such as filtering, sorting, grouping, and projecting data through several stages. This guide aims to make the aggregation pipeline accessible and practical by addressing real-world scenarios and questions.

Setting Up Your Environment

To follow along with the examples in this guide, ensure you have a MongoDB Atlas account. If you don't have one, sign up for free here. Once logged in:

  1. Create a free cluster, name it 'AggregationPipeline' or a name of your choice.

  2. Create a database within your cluster.

  3. Allow network access from 0.0.0.0/0 for simplicity (ensure you use secure settings in production).

  4. Set up a username and password for database access.

Importing Sample Data

For this guide, we'll use a sample dataset. Download the Sample data Insert the data into your Atlas database:

  1. Create a collection named 'users' and insert the data in it.

  2. Create a collection named 'books' and insert the books' data.

  3. Create a collection named 'authors' and insert the authors' data.

  4. Before we dive into the MongoDB Aggregation Pipeline guide, take a moment to explore the data structures in the 'data,' 'books,' and 'authors' collections. Open each collection to inspect the JSON objects, observing different properties and their corresponding values within each document.

at this point your Atlas look like this :

Atlas collection dashboard

Now, you're ready to follow along with the aggregation pipeline examples.

Performing Aggregation Pipelines on Atlas

You can directly perform aggregation pipelines on MongoDB Atlas without downloading additional software. Here's how:

  1. In your MongoDB Atlas dashboard, navigate to the 'Database' tab and select your database.

  2. Click on the corresponding collection on which you want to perform aggregation.

  3. Once in the collection view, go to the 'Aggregation Pipeline' tab.

  4. Select text to write the pipelines in CLI way, u can also do this using GUI ( Stages Option ).

  5. You'll see an empty array; start writing your aggregation pipeline code directly in this field using JSON-like objects.

Aggregation pipeline tab

Now that you've set up your environment and are ready to solve real-world queries using MongoDB Aggregation Pipeline.

Question 1: How many users are active?

Let's break down the first question and its solution step by step:

Problem Statement:

We want to determine the number of active users in our database.

Aggregation Pipeline:

[
  // Stage 01: Filtering the active users
  {
    $match: {
      isActive: true,
    },
  },
  // Stage 02: Counting the active users
  {
    $count: 'activeUsers',
  },
]

Explanation:

Stage 01 -$match: This stage filters documents based on the condition provided, selecting only users with the isActive field set to true.

Stage 02 -$count: The second stage counts the number of documents passing through the pipeline and stores the result in a new field named 'activeUsers.'

Side Note:

In our database, we have numerous users, and we want to apply these operations to a single document. To achieve this, we group the users into a single group. However, we don't want to group them by factors like gender or age. We use null as the _id field value to group the users into a single group without any specific criteria.

Now that you understand the solution to the first question, let's move on to more complex scenarios in the subsequent questions.

Question 2: What is the average age of all users?

Problem Statement:

Determine the average age of all users in the database.

Aggregation Pipeline:

[
  // Stage 01: Grouping the users into a single group
  {
    $group: {
      _id: null,
      averageAge: {
        $avg: '$age',
      },
    },
  },
]

Explanation:

Stage 01 -$group: In this stage, we group all users into a single group. The _id: null ensures that we're not grouping users based on any specific criteria. We then use the $avg operator to calculate the average age of all users and store the result in a new field called averageAge.

Now that we have the solution for the second question, let's explore more aspects of MongoDB's Aggregation Pipeline.

Question 3: List the top 5 most common favorite fruits among users.

Problem Statement:

Identify the top 5 most common favorite fruits among users.

Aggregation Pipeline:

[
  // Stage 01: Grouping the users by their favorite fruit
  {
    $group: {
      _id: '$favoriteFruit',
      myCount: {
        $sum: 1,
      },
    },
  },
  // Stage 02: Sorting the fruits by their count
  {
    $sort: {
      myCount: -1, // 1 for ascending, -1 for descending
    },
  },
  // Stage 03: Limiting the result to the top 5
  {
    $limit: 5,
  },
]

Explanation:

Stage 01 -$group: This stage groups users based on their favorite fruit. The $sum: 1 counts the occurrences of each fruit.

Stage 02 -$sort: The aggregation pipeline then sorts the fruits based on their count in descending order (myCount: -1), placing the most common fruits at the beginning.

Stage 03 -$limit: Finally, we limit the result to the top 5 fruits.

Now that we've identified the top 5 most common favorite fruits, let's explore more advanced concepts in the upcoming questions.

Question 4: Find the total numbers of males and females.

Problem Statement:

Determine the total number of males and females in the user database.

Aggregation Pipeline:

[
  // Stage 01: Grouping the users by their gender
  {
    $group: {
      _id: '$gender',
      // Stage 02: Counting the users in each group
      genderCount: {
        $sum: 1,
      },
    },
  },
]

Explanation:

Stage 01 -$group: In this stage, we group users based on their gender. The _id: '$gender' ensures separate counts for each gender.

Stage 02 -$sum: Within each group, we use $sum: 1 to count the number of users in that particular gender group.

Caution: It's advised not to use the count method to count the number of documents in a collection due to performance considerations. Here, we use $sum for a more efficient count based on the situation.

Question 5: Which country has the highest number of users?

Problem Statement:

Identify the country with the highest number of users.

Aggregation Pipeline:

[
  // Stage 01: Grouping the users by their country
  {
    $group: {
      _id: '$company.location.country',
      // Counting users in each group
      countryUserCount: {
        $sum: 1,
      },
    },
  },
  // Stage 02: Sorting the countries by their user count
  {
    $sort: {
      countryUserCount: -1,
    },
  },
  // Stage 03: Limiting the result to top 1 to find the highest number of users
  {
    $limit: 1,
  },
]

Explanation:

Stage 01 -$group: In this stage, users are grouped based on their country (_id: '$company.location.country'). We count the users in each group using $sum: 1.

Stage 02 -$sort: The aggregation pipeline then sorts the countries based on the user count in descending order (countryUserCount: -1), placing the country with the highest user count at the beginning.

Stage 03 -$limit: Finally, we limit the result to the top 1 to identify the country with the highest number of users.

Question 6: List all unique colors present in the collection.

Problem Statement:

Retrieve a list of all unique eye colors present in the user collection.

Aggregation Pipeline:

[
  // Stage 01: Grouping the users by their eye color
  {
    $group: {
      _id: '$eyeColor',
    },
  },
]

Explanation:

Stage 01 -$group: In this stage, users are grouped based on their eye color (_id: '$eyeColor'). This effectively collects all unique eye colors in the dataset.

Question 7: What is the average number of tags per user?

Problem Statement:

Determine the average number of tags each user has.

Aggregation Pipeline:

[
  // Stage 1: Unwind the tags array to get individual tags
  {
    $unwind: '$tags',
  },
  // Stage 2: Group by _id and count the number of tags
  {
    $group: {
      _id: '$_id',
      numberOfTags: {
        $sum: 1,
      },
    },
  },
  // Stage 3: Group by null and calculate the average number of tags
  {
    $group: {
      _id: null,
      averageNumberOfTags: {
        $avg: '$numberOfTags',
      },
    },
  },
]

Explanation:

Stage 1 -$unwind: This stage spreads the tags array into individual elements, allowing us to perform operations on each tag separately.

Stage 2 -$group: In this stage, we group the tags by the user's _id and count the number of tags for each user.

Stage 3 -$group: Finally, we group by null to calculate the average number of tags across all users. The $avg: '$numberOfTags' calculates the average.

Understanding the use of $unwind and multiple $group stages helps us solve complex problems like finding the average number of tags per user. Let's continue exploring more features of MongoDB's Aggregation Pipeline in the upcoming questions.

Question 8: How many users have "enim" as one of their tags?

Problem Statement:

Count the number of users who have "enim" as one of their tags.

Aggregation Pipeline:

[
  // Stage 1: Match users with the tag 'enim'
  {
    $match: {
      tags: 'enim',
    },
  },
  // Stage 2: Counting all users with 'enim' tags
  {
    $count: 'allUsersWithEnimTags',
  },
]

Explanation:

Stage 1 -$match: This stage filters users based on the condition that the tags field must contain the value 'enim'.

Stage 2 -$count: The second stage counts the number of users who satisfy the matching condition and stores the result in a field named 'allUsersWithEnimTags.'

This example demonstrates how to use the $match stage to filter documents based on a specific condition.

Question 9: What are the names and ages of the users who are inactive and have "velit" as a tag?

Problem Statement:

Retrieve the names and ages of users who are inactive and have "velit" as a tag.

Aggregation Pipeline:

[
  // Stage 1: Match users who are inactive and have "velit" as a tag
  {
    $match: {
      isActive: false,
      tags: 'velit', // AND condition within the same object
    },
  },
  // Stage 2: Project only the name and age fields
  {
    $project: {
      name: 1, // 1 means include
      age: 1,
    },
  },
]

Explanation:

Stage 1 -$match: This stage filters users based on two conditions: isActive must be false, and the tags field must contain the value 'velit'.

Stage 2 -$project: The second stage projects only the specified fields (name and age). The fields not mentioned in the projection are excluded.

Side Note: Projection, as used in $project, means including only the fields mentioned in the object and excluding the rest.

Understanding the use of $match and $project stages helps us create more refined queries.

Question 10: How many users have a phone number starting with +92 330..?

Problem Statement:

Count the number of users whose phone numbers start with +92 330.

Aggregation Pipeline:

[
  // Stage 1: Match users with phone numbers starting with +92 330
  {
    $match: {
      'company.phone': /^\+92 330/,
    },
  },
  // Stage 2: Counting users with the requested phone number format
  {
    $count: 'usersWithRequestedNumberFormat',
  },
]

Explanation:

Stage 1 -$match: This stage filters users based on the condition that the company.phone field must match the regular expression /^\+92 330/. The regular expression checks if the phone number starts with +92 330.

Stage 2 -$count: The second stage counts the number of users who satisfy the matching condition and stores the result in a field named 'usersWithRequestedNumberFormat.'

Side Note: ๐Ÿ˜œ By the way, I chose this specific format because it happens to be my own number, perhaps! ๐Ÿ˜…

In this scenario, you've learned how to embed a regular expression (/^\+92 330/) within the $match stage to filter documents based on a specific pattern.

Question 11: Who has registered the most recently?

Problem Statement:

Identify the user who has registered most recently. Also, show the top 2 most recently registered users with their names, date of registration, and favorite fruit.

Aggregation Pipeline:

[
  // Stage 1: Sorting users by registration date in descending order
  {
    $sort: {
      registered: -1,
    },
  },
  // Stage 2: Limiting the result to the top 2 most recently registered users
  {
    $limit: 2,
  },
  // Stage 3: Projecting specific fields - name, favoriteFruit, and registered
  {
    $project: {
      name: 1,
      favoriteFruit: 1,
      registered: 1,
    },
  },
]

Explanation:

Stage 1 -$sort: This stage sorts users based on the registered field in descending order (registered: -1), placing the most recently registered users at the beginning.

Stage 2 -$limit: The second stage limits the result to the top 2 most recently registered users.

Stage 3 -$project: In the final stage, we project specific fields (name, favoriteFruit, and registered) to display in the result.

Now, you've learned how to sort, limit, and project specific fields to find the most recently registered users.

Question 12: Categorize users by their favoriteFruit.

Problem Statement:

Group users based on their favorite fruit and create a catalog of users for each fruit.

Aggregation Pipeline:

[
  // Stage 1: Grouping users by their favoriteFruit
  {
    $group: {
      _id: '$favoriteFruit',
      myUsers: {
        $push: '$name',
      },
    },
  },
]

Explanation:

Stage 1 -$group: This stage groups users based on their favoriteFruit field (_id: '$favoriteFruit'). The $push: '$name' operation is used to create an array (myUsers) for each fruit and push the names of users with that favorite fruit into the array.

Tip: Catalogize means grouping the data by a specific field. In this case, we're catalogizing users based on their favorite fruit.

Question 13: How many users have 'ad' as the 2nd tag in their list of tags?

Problem Statement:

Count the number of users whose second tag in the list of tags is 'ad'.

Aggregation Pipeline:

[
  // Stage 1: Match users with 'ad' as the 2nd tag
  {
    $match: {
      'tags.1': 'ad',
    },
  },
  // Stage 2: Counting users with 'ad' as the 2nd tag
  {
    $count: '2ndTagMatched',
  },
]

Explanation:

Stage 1 -$match: This stage filters users based on the condition that the value at the 2nd position ('tags.1') of the tags array must be 'ad'.

Stage 2 -$count: The second stage counts the number of users who satisfy the matching condition and stores the result in a field named '2ndTagMatched.'

Notes: In this case, each object in the array has the "tags" in different positions, but we specifically want to check the 2nd position of the tags array.

Tip: Whenever we need to match the value of an array at a specific position, we should use the $match operator.

Question 14: Find all users who have both 'enim' and 'id' in their tags.

Problem Statement:

Retrieve all users who have both 'enim' and 'id' in their list of tags.

Aggregation Pipeline:

[
  // Stage 1: Match users with both 'enim' and 'id' in their tags
  {
    $match: {
      tags: {
        $all: ['enim', 'id'],
      },
    },
  },
]

Explanation:

Stage 1 -$match: This stage filters users based on the condition that the tags array must contain both 'enim' and 'id'. The $all operator is used to check if all specified values are present in the array.

Question 15: List all the companies located in the USA with their corresponding user count.

Problem Statement:

Provide a list of all companies located in the USA along with the count of users associated with each company.

Aggregation Pipeline:

[
  // Stage 1: Match companies located in the USA
  {
    $match: {
      'company.location.country': 'USA',
    },
  },
  // Stage 2: Grouping companies by their title and counting users for each company
  {
    $group: {
      _id: '$company.title',
      userCount: {
        $sum: 1,
      },
    },
  },
]

Explanation:

Stage 1 -$match: This stage filters companies based on the condition that the company.location.country must be 'USA'.

Stage 2 -$group: In this stage, companies are grouped by their title (_id: '$company.title'). The $sum: 1 operation is used to count the number of users for each company.

Question 16: Lookup with $addFields and $first

Problem Statement:

Perform a lookup operation with $addFields and $first to retrieve details from the 'authors' collection based on the 'author_id' field.

Aggregation Pipeline (1st Approach):

[
  // Stage 1: Lookup operation with authors collection
  {
    $lookup: {
      from: 'authors',
      localField: 'author_id',
      foreignField: '_id',
      as: 'author_details',
    },
  },
  // Stage 2: Using $addFields and $first to remove the array
  {
    $addFields: {
      author_details: {
        $first: '$author_details',
      },
    },
  },
]

Aggregation Pipeline (2nd Approach):

[
  // Stage 1: Lookup operation with authors collection
  {
    $lookup: {
      from: 'authors',
      localField: 'author_id',
      foreignField: '_id',
      as: 'author_details',
    },
  },
  // Stage 2: Using $addFields and $arrayElemAt to remove the array
  {
    $addFields: {
      author_details: {
        $arrayElemAt: ['$author_details', 0],
      },
    },
  },
]

Explanation:

Lookup Operation ($lookup): This stage performs a lookup operation with the 'authors' collection, matching documents based on the 'author_id' field.

Stage 2 ($addFields): Both approaches use $addFields to add a new field (author_details) to the document. The $first operator in the first approach and $arrayElemAt: ['$author_details', 0] in the second approach are used to extract the first element of the array, effectively removing the array.

These approaches make the result more readable and eliminate the unnecessary array when there is only one object in it.

Side Note: ๐Ÿ˜„ This is not just about making MongoDB happy; it's about keeping your frontend developers happy too! Say goodbye to unnecessary arrays that used to annoy front-end guys. Now your data is more readable and frontend-friendly! ๐ŸŽ‰

Conclusion:

Congratulations on completing this comprehensive guide on MongoDB Aggregation Pipeline! ๐ŸŽ‰ We've explored various scenarios, from counting users and calculating averages to filtering and categorizing data. The use of operators like $match, $group, $sort, and $lookup has been demystified with practical examples.

Connect and Explore

Feel free to connect with me on LinkedIn and check out the source code for all the pipelines, along with the sample data, in this Replit๐Ÿš€

For queries, article requests, or any other discussions, connect with me on GitHub.

Looking forward to seeing you in more exciting tech adventures! ๐Ÿ‘‹

Happy coding! ๐Ÿš€

ย