Aggregation Pipelines on MongoDB

Why use MongoDB?
1. Dataset Used
Aggregation pipeline basics
A more complex example
Conclusion

Why use MongoDB?

MongoDB is a document oriented NOSQL open source database, which means that data does not necessarily have to follow a certain schema. This makes MongoDB an ideal candidate as a database for big data workloads as it ensures a better performance and partially gets rid of the worries created by enforced schemas or costly ALTER TABLE operations. So, how do we query MongoDB?

Dataset Used

MongoDB stores documents in a format called BSON (or binary JSON), each document is self-describing and includes it’s length and data types. The minimal data unit is a small key-value pair document. Throughout this article, we will be working on the following dataset:

db.restaurantOpinions.insertMany([
   {"email": "john@gmail.com", "responseDate": new Date("2020-03-04"), questions: [{"questionId": "6a4c", "answer": "Lasagne"}, {"question": "94fa", "answer": 5}]},
   {"email": "john@gmail.com", "responseDate": new Date("2020-06-09"), questions: [{"questionId": "6a4c", "answer": "Pizza"}, {"question": "94fa", "answer": 3}]},
   {"email": "jessie@gmail.com", "responseDate": new Date("2021-02-04"), questions: [{"questionId": "6a4c", "answer": "Chicken Salad"}, {"question": "94fa", "answer": 4}]},
   {"email": "jessie@gmail.com", "responseDate": new Date("2021-02-20"), questions: [{"questionId": "6a4c", "answer": "Chicken Salad"}, {"question": "94fa", "answer": 5}]},
   {"email": "edward@hotmail.com", "responseDate": new Date("2020-12-31"), questions: [{"questionId": "6a4c", "answer": "Roasted pork"}, {"question": "94fa", "answer": 5}]},
   {"email": "emily@yahoo.com", "responseDate": new Date("2020-12-31"), questions: [{"questionId": "6a4c", "answer": "Roasted pork"}, {"question": "94fa", "answer": 5}]}
])

As we can see, questions have an id that indicates more underlying data. In this case, the ids have the following meanings:

6a4c – What is your favorite meal?
94fa – How likely would you be to recommend this restaurant?

Aggregation pipeline basics

MongoDB query engine allows for quite complex aggregation pipelines to be executed right on the database. There are several stages that can be defined but before diving into the more complex pipelines, let’s take a look at some simple stages in order to get comfortable with the basics:

Let’s suppose we would like to know a hundred answers to the question about the customers favorite food for our most loyal customer, John, as well as the date he said those were his favorites. We could write the following query:

db.restaurantOpinions.aggregate([
   { $match: { email: "john@gmail.com" } },
   {
       $project: {
           "favoriteFood": { "$filter": {
               "input": "$questions",
               "as": "question",
               "cond": { "$eq": ["$$question.questionId", "6a4c"] }
           } },
           _id: false,
           responseDate: true
       }
   },
   { $limit: 100 }
]);

In this query, the following happens:

$match filters the number of totals results to those having John’s email. It is recommended to place filters as early as possible in the pipeline, as it reduces the number of rows to process. If it is placed at the beginning, it will use the database index as if it was a find query.
$project will create a new field named favoriteFoods which will only contain the question with id “6a4c”, which represents the favorite food. It will also include the responseDate in the final result and exclude the MongoDB generated id.
$limit will indicate that we only want to know a hundred (by any criteria, since it’s not sorted) of all results.

Yielding the following results:

favoriteFood	responseDate
`[{"questionId": "6a4c", "answer": "Lasagne"}]`	`2020-03-04T00:00:00.000Z`
`[{"questionId": "6a4c", "answer": "Pizza"}]`	`2020-06-09T00:00:00.000Z`

A more complex example

Let’s now suppose that we would like to figure out how likely each customer is, on average, to recommend this restaurant. This data is part of our questions, so we will have to create a row for each question in order to work with it. We could write the following query:

db.restaurantOpinions.aggregate([
  { $unwind: "$questions" },
  { $match: { "questions.questionId": "94fa" } },
  { $group: {
      _id: "$email",
      recomendationAvg: { $avg: "$questions.answer" }
  }},
  {
      $project: {
          email: "$_id",
          _id: false,
          recomendationAvg: true,
      }
  },
  { $sort: { recomendationAvg : 1 } }
]);

As we can see, it follows a similar structure to the example before, however, $unwind is allowing the rest of the stages to work without an array but instead a row for each of the questions. As before, let’s describe the query:

$unwind will create a row for each element found in the questions array.
$match filters all the questions, so only the ones with id 94fa are kept since the next stages are going to work with the answers to those questions.
$group creates a group for each email and calculates the average value of the answers in that group.
$project will set the _id column name back to email, since grouping returned a _id field and a recommendationAvg value.
$sort will take all results and order them by recommendationAvg based on an ascending criteria.

The query will yield the following results:

email	recomendationAvg
john@gmail.com	4
jessie@gmail.com	4.5
emily@yahoo.com	5
edward@hotmail.com	5

Conclusion

We have covered a few basic stages and pointed to more complex ones, however, there are a lot more stages and operators provided. $lookup deserves a special mention, since it allows the pipeline to perform a join with another collection while processing.

MongoDB aggregation pipelines are powerful and and a prefered method of aggregating data. With built in optimization, pipelines ease computing results and only need the user to be mindful of a few optimizations clearly mentioned in their documentation.

If you found this post useful, we recommend you to visit the Data Analytics category of our blog and share it with your contacts so they can also read it and give their opinion. See you in networks!

code, mongo, NoSQL, open source, pipelines, programming

Why use MongoDB?

Dataset Used

Aggregation pipeline basics

A more complex example

Conclusion

Víctor Prats