« Back to home

Notes on MongoDB


  • MongoDB is document oriented.
  • MongoDB is schemaless.
  • No joins, no transactions (scales horizontally)

Unlike in a relational database, you are not constrained to follow any particular schema. Do any:

  • Begin inserting documents with the new schema.
  • Perform a bulk update on the existing documents.
  • Begin updating old documents to the new schema one by one at an appropriate event
    (such as getting read from or written to), as coded in the application.

MongoDB does not support joins as a design decision because they do not scale horizontally, and it does not support SQL because that query language was built around joins and transactions

Building an App

Create a collection

use test

and save stuff to it (save will overwrite if an _id is given)

db.things.save({ a: 1, b : 1, fruit: ["apple", "orange", "pear"]})
db.mycol.save({_id: 1, title: "overview", url:'google.com', tags: [], likes:200 })
db.mycol.insert({title: "overview", url:'google.com', tags: ['uno', 'dos'], likes:100 })

delete records

db.mycol.remove({_id: 1})

filtering (&& is comma, || is $or)

db.mycol.find({"likes":{$ne:101}, title: "overview"})
db.mycol.find({or: ["likes": {$gt:50},{"_id":{$nq: null}} ]})
db.mycol.find({"title": "overview", $or: [{"likes": {$gt:50}}, {"_id": {$ne: null}} ]})

other find functions


Add anything to a collection as far is JSON (no quotes is OK)

{ "_id" : ObjectId("5567d9251f523c3d2f2fc579"), "a" : 1, "b" : 2, "c" : 3 }
{ "_id" : ObjectId("5567d9571f523c3d2f2fc57a"), "a" : 3, "b" : 4, "c" : 3, "d" : 5 }
{ "_id" : ObjectId("5567dc45fd1f5b47c9886a05"), "a" : 1, "b" : 1, "fruit" : [ "apple", "orange", "pear" ] }

Better output

{ "_id" : ObjectId("5567d9251f523c3d2f2fc579"), "a" : 1, "b" : 2, "c" : 3 }
        "_id" : ObjectId("5567d9571f523c3d2f2fc57a"),
        "a" : 3,
        "b" : 4,
        "c" : 3,
        "d" : 5
        "_id" : ObjectId("5567dc45fd1f5b47c9886a05"),
        "a" : 1,
        "b" : 1,
        "fruit" : [
        "_id" : ObjectId("5567dd54bed3472b71fbf3fe"),
        "name" : "andrew",
        "address" : {
                "street" : "elm drive",
                "city" : "Palo Alto",
                "zipcode" : "94126",
                "houser_number" : 34

These are all valid JSON inputs

{ a : 1, b : 2, c : 3 }
{ a : 1, b : 2, c : [ 1, 2, 3, 4, 5 ] }
{ a : 1, b : { }, c : [ { a : 1, b : 2 }, 5, 6 ] }
{ } 

Installing on Debian

Based on the official documentation mongodb.org

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10
echo "deb http://repo.mongodb.org/apt/debian "$(lsb_release -sc)"/mongodb-org/3.0 main" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.0.list

if not pre-installed

sudo apt-get install lsb-release

Install latest stable release

sudo apt-get install -y mongodb-org

Fire up mongod

sudo service mongod start


May need to remove lock data file

rm  /var/lib/mongodb/mongod.lock

May need to set locales etc.

export LC_ALL=C
export LC_CTYPE="UTF-8"
export LANG="en_US.UTF-8"

About the 32-bit implementation

A note about the 32-bit implementation of MongoDB, if you're curious: It can safely handle up to around 2 GB of data, but if you do the math, 2^32 bytes = 4.3 gigabytes of data. Why the discrepancy? Well, the 4.3 GB is a good predictor of the largest total size of the data files you can get, but in fact, it is unlikely that a production implementation would be able to use even that much.

The 4.3 GB limit is true when we talk about the size of the data files, but often MongoDB doesn't use space 100% efficiently. It trades some space for some speed. As we add documents, we also (by default) add "padding", or empty space into which MongoDB can grow.

In MongoDB 3.0, MongoDB uses power of 2 sizes record allocation strategy by default. The record size is a power of 2 (32 bytes, 64 bytes, 128 bytes, etc). Any document created will fill the smallest such record that can contain it. This makes for fewer moves, and a good likelihood that a record will fit the space vacated by another record. Overall database performance improves, but at the cost to disk space. Plus there is other metadata to consider, such as indexes and the operations log. Add it all together, and it's very likely that, somewhere above 2 gigabytes but below 4 GB, a 32-bit implementation of MongoDB will run out of space. This is why it shouldn't be used in production.

CRUD operations

insert, find, update, delete:


{"_id": ObjectId("2883..."), "name": "john", "age": 10}
{"_id": ObjectId("2312..."), "name": "maria", "age": 10}

db.people.findOne({"age": 10})

{"_id": ObjectId("2883..."), "name": "john", "age": 10}

db.people.findOne({"age": 10}, {"_id": false, "name": true, "age": false})
{"name": "john"}


multiple fiels in find, equivalent to &&

db.scores.find({"_id:" 19, "type": "essay"} )

using $gt and $lt, $gte, $lte

db.scores.find({"scores": { $gt: 95, $lte: 98},"type": "essay"})

inequalities on strings are resolved lexicographically

db.scores.find({"name": { $gte: "G"}} 

operators exists, types, and regex

db.people.find({profession: {$exists : true}})
db.people.find({name: {$type : 2}}) //2 = strings, see BSON docs
db.people.find({name: {$regex : "e$", $regex: "^A"}}) //starts with A and last letter is e

operator $or

db.people.find({ $or: [{name: "john"}, {age : 10}]})

operator $and (infrequent)

db.people.find({ $and: [{name: "john"}, {age : 10}]})

Note: for the below example the second condition overwrites the first one, not AND operation

db.scores.find({score: {$gt : 50}, {score: {$lt : 60}})

querying inside arrays

db.people.find({classes: "maths"})
{"_id": ObjectId("2312..."), "name": "maria", "age": 10, "classes": ["maths", "physEd"]}

No recursive depth in case of nested documents!

using $in and $all (arrays)

// includes all of them
db.people.find({favorites: { $all: ["maths", "physEd"]}})
{"_id": ObjectId("2312..."), "name": "maria", "age": 10, "classes": ["maths", "physEd"]}
{"_id": ObjectId("2312..."), "name": "mario", "age": 12, "classes": ["physEd","maths", "art"]}

// includes one or the other
db.people.find({favorites: { $in: ["art", "maths"]}})
{"_id": ObjectId("2312..."), "name": "maria", "age": 10, "classes": ["maths", "physEd"]}
{"_id": ObjectId("2312..."), "name": "mario", "age": 12, "classes": ["physEd","maths", "art"]}

dot notation

// must match strict order
db.people.find({"phone": {
			"work": "12355606451",
			"home": "12878783556"

	"name": "pete",
	"phone": {
			"work": "12355606451",
			"home": "12878783556"

# look inside nested documents using dot
db.people.find({"phone.work": "work": "12355606451"});

querying, cursors

// hold that cursor
cur = db.people.find(); null;

// methods
//check for end of cursor

//step through

//set cursor step size with next()
cur.limit(5); null

cur.sort({name: -1}); null

cur.sort({name: -1}).limit(5); null

actually processed in db after used; null is just a shell hack

//skip the first 2 elements
cur.sort({name: -1}).limit(5).skip(2); null


db.people.count({age: 10})


A single element update

{"_id": ObjectId("2312..."), "name": "mario", "age": 12, "classes": ["physEd","maths", "art"]}

db.people.update({name: "mario"}, {age: 20})

{"_id": ObjectId("2312..."), "name": "mario", "age": 20}

IMPORTANT: update replaces the whole record, fields not mapped will be lost!

using $set, $inc

db.people.update({name: "mario"}, {$set: {age: 20}})
{"_id": ObjectId("2312..."), "name": "mario", "age": 20, "classes": ["physEd","maths", "art"]}

db.people.update({name: "mario"}, {$inc: {age: 5}})
{"_id": ObjectId("2312..."), "name": "mario", "age": 25, "classes": ["physEd","maths", "art"]}

If the field doesn't exists, it'll be created

using unset

db.people.update({name: "mario"}, {$unset: {classes: 1}})
{"_id": ObjectId("2312..."), "name": "mario", "age": 20}

the value used 1, is arbitrary and is actually ignored

Array manipulation with $push, $pop, $pull, $pushAll, $pullAll, $addToSet

db.arrays.insert(_id: 10, a: [1,2,3,4])
db.arrays.update({_id: 10}, {$set {"a.2": 33}})

db.arrays.update({_id: 10}, {$push{"a": 55}})

//removes right most element
db.arrays.update({_id: 10}, {$pop{"a": 1}})
//removes left most element
db.arrays.update({_id: 10}, {$pop{"a": -1}})

db.arrays.update({_id: 10}, {$pushAll{"a": [55, 66, 77]}})

//removes the value, regardless its position
db.arrays.update({_id: 10}, {$pull{"a": 5}})
db.arrays.update({_id: 10}, {$pull{"a": [2, 5, 6]}})

// treat array as a set; subsequent calls have no effect
db.arrays.update({_id: 10}, {$addToSet {"a": 5}})


Updates and existing record, otherwise create it (to use only with concrete values)

db.people.update({_id: 10}, {$set : {age: 10}}, {upsert: true})	


Must pass option: multi, {} matches all elements
default behaviour is to update only ONE element in the collection

db.people.update({}, { $set: {surname: "doe"}}, {multi: true})

Single thread for each operation, might yield to other write operations.

e.g. Update all records with score less than 70 and add 20 points

db.scores.update({'score': {$lt: 70}}, {$inc: {'score': 20}}, {multi: true})


db.people.remove({name: "alex"})

remove all

// more efficient plus removes metadata

Mongo Shell

  • Essentially, a javascript runtime

    • evaluates javascript, well-documented and easy to pickup
  • Autocompletion

  • Emacs key bindings: see help keys

  • Object in javascript z ={a: 1} can be accessed by
    z.a == z["a"] == (w="a"; z[w])

  • Rich in object types. See BSON specs


d = New Date()
ISODate("2012-10-11 7:14:12:23.788")
obj = {date: d}
  • Strings must be encodes UTF-8
  • ObjectId type can be used as identifier


Install with pip, .e.g,

 pip install https://github.com/mongodb/mongo-python-driver/archive/3.0b1.tar.gz 

Minimal example

import pymongo
from pymongo import MongoClient

connection = MongoClient('localhost', 27017)
db = connection.test
names = db.names
item = names.find_one()
print item['name']

load web data

Get some data

wget http://www.reddit.com/r/technology/json > reddit.json

cat reddit.json| python -m json.tool| more

In python, user urllib2, get_reddit_data.py

import json
import urllib2
import pymongo

connection = pymongo.MongoClient("mongodb://localhost")
stories = db.stories

reddit_page = urllib2.urlopen("http://www.reddit.com/r/technology")
parsed = json.loads(reddit_page.read())

for item in parsed['data']['children']:


Finding regit post from Apple and Google

import pymongo
connection = pymongo.MongoClient("mongodb://localhost")
stories = db.stories

def find():
	print "find, stories for apple and google"
	q = {'title':{'$regex': 'apple|google', '$options': 'i'}}
	p = {'title':1, '_id': 0}
		cursor = stories.find(q, p)
	except Exception as e:
		print "Unexpected error: ", type(e), e
	for doc in cursor
		print doc

sort, skip and limit

Always happens in this order, regardless the composition order


e.g. get 5th element from last (); same results

cursor = scores.find(query).sort('s_id', pymongo.ASCENDING).skip(4).limit(1)

cursor2 = scores.find(query).skip(4)
cursor2 = cursor2.limit(1)
cursor2 = cursor2.sort('s_id', pymongo.ASCENDING)

for doc in cursor
	print doc

for doc2 in cursor2
	print doc2

complex sorting

scores.sort([('s_id', pymongo.ASCENDING), ('score', pymongo.ASCENDING)])

Insert, update, deleting

pymongo vs server

  • Insert

    • insert_one vs. insert
    • insert_many vs. bulk
  • Update

    • update_one vs. update
    • update_many vs. update
    • replace_one vs. update
  • Delete

    • delete_one vs. remove
    • delete_many


_id must be unique or autofilled

people.insert_one({'_id': 1,  'name':'mark'})


the second argument ordered, makes insertion in strict order. If one fails, subsequent will fail

people.insert_many(collection_insert, ordered=True)	

insert_ordered default is True

update_one, update_many

update_one + filter, updates($operator)
update_many + filter, updates($operator)

	res = scores.update_one({'id': my_id}, 
					{'$set': {'timestamp' : datetime.datetime.utcnow() }}	)

	print res.matched_count

implemented with the same underlying function


updates entire document. Useful by:

  1. Pulling all documennt
  2. Making changes
  3. saving back to the document

set null all timestamps from all records

record  = scores.findOne({...})

// vulnerable block

id = record['_id'];
record['timestamp'] = null

// end block
scores.replace_one({'_id': id} record)

all updates and replace uses udpdate command under the hood

update_one or update_many, you must specify a $operator of some sort.

if you use replace_one, no $operator needed


if match
-> update

things.update_one ( {..}, {..}, upsert=True)
things.update_many( {..}, {..}, upsert=True)
things.replace_one( {..}, {..}, upsert=True)

NOTE: for replace_one upsert only inserts second argument object, not filter


{_id:"bat", friend:'ball', cousin:'glove'} // equivalent

stuff.update_one({'_id':'bat'}, {'$set':{'friend':'ball', 'cousin':'glove'}}, upsert=True)

stuff.update_one({'_id':'bat'}, {'$set':{'_id':'bat', 'friend':'ball', 'cousin':'glove'}}, upsert=True)

stuff.replace_one({'_id':'bat'}, {'friend':'ball', 'cousin':'glove'}, upsert=True)


res = scores.delete_one({..})

res = scores.delete_many({..})

find_one_and _***

Combining the document state before or after updates

// gets the document AFTER the update

counter = find_one_and_update(	filter={'type': 'uid'},
							 	update={'$inc': {'value': 1}},

Actions are atomic, you can make changes and get the value in one operation

Use find_and_modify under the hood


Find Student with scores greater or equal to 65 in ascending order

db.grades.find({score: {$gte: 65}}).sort({score: 1 })

	{'$group':{'_id':'$student_id', 'average':{$avg:'$score'}}}, 

Delete lowest homework score for each student

  1. filter homeworks
  2. sort student, then sort score
  3. delete when record student_id changes
db.grades.find({"type": "homework"})
		.sort({"student_id": 1, "score": 1})

using the zip codes dataset

Where each record looks like:

	"_id" : "35004",
	"city" : "ACMAR",
	"loc" : [
	"pop" : 6055,
	"state" : "AL"

and some cities have multiple zip codes, i.e., city field is not unique

Sum population of all cities where the city name is a digit

    {"$match": {"city": {$regex: '[0-9]+'}}},
   	{"$group": {"_id": null, "total": {$sum: "$pop"}}}

Or you could operate on the string using:

     	"city": "$city",
		"first_char": {"$substr" : ["$city",0,1]},
		"pop": "$pop"


comments powered by Disqus