MongoDB

Intro

  • MongoDB is document oriented.
  • MongoDB is schemaless.
  • No joins, no transactions (scales horizontally)

Unlike in a relational database, you are not constrained to follow any particular schema. Do any:

  • Begin inserting documents with the new schema.
  • Perform a bulk update on the existing documents.
  • Begin updating old documents to the new schema one by one at an appropriate event
    (such as getting read from or written to), as coded in the application.

MongoDB does not support joins as a design decision because they do not scale horizontally, and it does not support SQL because that query language was built around joins and transactions

Building an App

Create a collection

use test

and save stuff to it (save will overwrite if an _id is given)

db.things.save({ a: 1, b : 1, fruit: ["apple", "orange", "pear"]})
db.mycol.save({_id: 1, title: "overview", url:'google.com', tags: [], likes:200 })
db.mycol.insert({title: "overview", url:'google.com', tags: ['uno', 'dos'], likes:100 })

delete records

db.mycol.remove({_id: 1})

filtering (&& is comma, || is $or)

db.mycol.find({"likes":100})
db.mycol.find({"likes":{$ne:101}, title: "overview"})
db.mycol.find({or: ["likes": {$gt:50},{"_id":{$nq: null}} ]})
db.mycol.find({"title": "overview", $or: [{"likes": {$gt:50}}, {"_id": {$ne: null}} ]})

other find functions

db.mycol.findOne()

Add anything to a collection as far is JSON (no quotes is OK)

db.things.find()
{ "_id" : ObjectId("5567d9251f523c3d2f2fc579"), "a" : 1, "b" : 2, "c" : 3 }
{ "_id" : ObjectId("5567d9571f523c3d2f2fc57a"), "a" : 3, "b" : 4, "c" : 3, "d" : 5 }
{ "_id" : ObjectId("5567dc45fd1f5b47c9886a05"), "a" : 1, "b" : 1, "fruit" : [ "apple", "orange", "pear" ] }

Better output

 db.things.find().pretty()
{ "_id" : ObjectId("5567d9251f523c3d2f2fc579"), "a" : 1, "b" : 2, "c" : 3 }
{
        "_id" : ObjectId("5567d9571f523c3d2f2fc57a"),
        "a" : 3,
        "b" : 4,
        "c" : 3,
        "d" : 5
}
{
        "_id" : ObjectId("5567dc45fd1f5b47c9886a05"),
        "a" : 1,
        "b" : 1,
        "fruit" : [
                "apple",
                "orange",
                "pear"
        ]
}
{
        "_id" : ObjectId("5567dd54bed3472b71fbf3fe"),
        "name" : "andrew",
        "address" : {
                "street" : "elm drive",
                "city" : "Palo Alto",
                "zipcode" : "94126",
                "houser_number" : 34
        }
}

These are all valid JSON inputs

{ a : 1, b : 2, c : 3 }
{ a : 1, b : 2, c : [ 1, 2, 3, 4, 5 ] }
{ a : 1, b : { }, c : [ { a : 1, b : 2 }, 5, 6 ] }
{ } 

Installing on Debian

Based on the official documentation mongodb.org

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10
echo "deb http://repo.mongodb.org/apt/debian "$(lsb_release -sc)"/mongodb-org/3.0 main" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.0.list

if not pre-installed

sudo apt-get install lsb-release

Install latest stable release

sudo apt-get install -y mongodb-org

Fire up mongod

sudo service mongod start

Troubleshooting

May need to remove lock data file

rm  /var/lib/mongodb/mongod.lock

May need to set locales etc.

export LC_ALL=C
export LC_CTYPE="UTF-8"
export LANG="en_US.UTF-8"

About the 32-bit implementation

A note about the 32-bit implementation of MongoDB, if you're curious: It can safely handle up to around 2 GB of data, but if you do the math, 2^32 bytes = 4.3 gigabytes of data. Why the discrepancy? Well, the 4.3 GB is a good predictor of the largest total size of the data files you can get, but in fact, it is unlikely that a production implementation would be able to use even that much.
The 4.3 GB limit is true when we talk about the size of the data files, but often MongoDB doesn't use space 100% efficiently. It trades some space for some speed. As we add documents, we also (by default) add "padding", or empty space into which MongoDB can grow.
In MongoDB 3.0, MongoDB uses power of 2 sizes record allocation strategy by default. The record size is a power of 2 (32 bytes, 64 bytes, 128 bytes, etc). Any document created will fill the smallest such record that can contain it. This makes for fewer moves, and a good likelihood that a record will fit the space vacated by another record. Overall database performance improves, but at the cost to disk space. Plus there is other metadata to consider, such as indexes and the operations log. Add it all together, and it's very likely that, somewhere above 2 gigabytes but below 4 GB, a 32-bit implementation of MongoDB will run out of space. This is why it shouldn't be used in production.

CRUD operations

insert, find, update, delete:

db.people.insert(john)
db.people.insert(maria)
db.people.find()

{"_id": ObjectId("2883..."), "name": "john", "age": 10}
{"_id": ObjectId("2312..."), "name": "maria", "age": 10}

db.people.findOne({"age": 10})

{"_id": ObjectId("2883..."), "name": "john", "age": 10}

db.people.findOne({"age": 10}, {"_id": false, "name": true, "age": false})
{"name": "john"}

Querying

multiple fiels in find, equivalent to &&

db.scores.find({"_id:" 19, "type": "essay"} )

using $gt and $lt, $gte, $lte

db.scores.find({"scores": { $gt: 95, $lte: 98},"type": "essay"})

inequalities on strings are resolved lexicographically

db.scores.find({"name": { $gte: "G"}} 
Gael
Gabriel
...

operators exists, types, and regex

db.people.find({profession: {$exists : true}})
db.people.find({name: {$type : 2}}) //2 = strings, see BSON docs
db.people.find({name: {$regex : "e$", $regex: "^A"}}) //starts with A and last letter is e

operator $or

db.people.find({ $or: [{name: "john"}, {age : 10}]})

operator $and (infrequent)

db.people.find({ $and: [{name: "john"}, {age : 10}]})

Note: for the below example the second condition overwrites the first one, not AND operation

db.scores.find({score: {$gt : 50}, {score: {$lt : 60}})

querying inside arrays

db.people.find({classes: "maths"})
{"_id": ObjectId("2312..."), "name": "maria", "age": 10, "classes": ["maths", "physEd"]}

No recursive depth in case of nested documents!

using $in and $all (arrays)

// includes all of them
db.people.find({favorites: { $all: ["maths", "physEd"]}})
{"_id": ObjectId("2312..."), "name": "maria", "age": 10, "classes": ["maths", "physEd"]}
{"_id": ObjectId("2312..."), "name": "mario", "age": 12, "classes": ["physEd","maths", "art"]}


// includes one or the other
db.people.find({favorites: { $in: ["art", "maths"]}})
{"_id": ObjectId("2312..."), "name": "maria", "age": 10, "classes": ["maths", "physEd"]}
{"_id": ObjectId("2312..."), "name": "mario", "age": 12, "classes": ["physEd","maths", "art"]}

dot notation

// must match strict order
db.people.find({"phone": {
			"work": "12355606451",
			"home": "12878783556"
	}});

{
	ObjectId("2312..."),
	"name": "pete",
	"phone": {
			"work": "12355606451",
			"home": "12878783556"
	}
}

# look inside nested documents using dot
db.people.find({"phone.work": "work": "12355606451"});

querying, cursors

// hold that cursor
cur = db.people.find(); null;

// methods
//check for end of cursor
cur.hasNext()

//step through
cur.next()

//set cursor step size with next()
cur.limit(5); null

//sort
cur.sort({name: -1}); null

//chaining
cur.sort({name: -1}).limit(5); null

actually processed in db after used; null is just a shell hack

//skip the first 2 elements
cur.sort({name: -1}).limit(5).skip(2); null

counting

db.people.count({age: 10})

Updating

A single element update

{"_id": ObjectId("2312..."), "name": "mario", "age": 12, "classes": ["physEd","maths", "art"]}

db.people.update({name: "mario"}, {age: 20})

{"_id": ObjectId("2312..."), "name": "mario", "age": 20}

IMPORTANT: update replaces the whole record, fields not mapped will be lost!

using $set, $inc

db.people.update({name: "mario"}, {$set: {age: 20}})
{"_id": ObjectId("2312..."), "name": "mario", "age": 20, "classes": ["physEd","maths", "art"]}

db.people.update({name: "mario"}, {$inc: {age: 5}})
{"_id": ObjectId("2312..."), "name": "mario", "age": 25, "classes": ["physEd","maths", "art"]}

If the field doesn't exists, it'll be created

using unset

db.people.update({name: "mario"}, {$unset: {classes: 1}})
{"_id": ObjectId("2312..."), "name": "mario", "age": 20}

the value used 1, is arbitrary and is actually ignored

Array manipulation with $push, $pop, $pull, $pushAll, $pullAll, $addToSet

db.arrays.insert(_id: 10, a: [1,2,3,4])
db.arrays.update({_id: 10}, {$set {"a.2": 33}})

db.arrays.update({_id: 10}, {$push{"a": 55}})

//removes right most element
db.arrays.update({_id: 10}, {$pop{"a": 1}})
//removes left most element
db.arrays.update({_id: 10}, {$pop{"a": -1}})

db.arrays.update({_id: 10}, {$pushAll{"a": [55, 66, 77]}})

//removes the value, regardless its position
db.arrays.update({_id: 10}, {$pull{"a": 5}})
db.arrays.update({_id: 10}, {$pull{"a": [2, 5, 6]}})

// treat array as a set; subsequent calls have no effect
db.arrays.update({_id: 10}, {$addToSet {"a": 5}})

Upserts

Updates and existing record, otherwise create it (to use only with concrete values)

db.people.update({_id: 10}, {$set : {age: 10}}, {upsert: true})	

Multiupdate

Must pass option: multi, {} matches all elements
default behaviour is to update only ONE element in the collection

db.people.update({}, { $set: {surname: "doe"}}, {multi: true})

Single thread for each operation, might yield to other write operations.

e.g. Update all records with score less than 70 and add 20 points

db.scores.update({'score': {$lt: 70}}, {$inc: {'score': 20}}, {multi: true})

Removing

db.people.remove({name: "alex"})

remove all

db.people.remove({})
// more efficient plus removes metadata
db.people.drop();

Mongo Shell

Essentially, a javascript runtime

  • evaluates javascript, well-documented  and easy to pickup

Autocompletion

Emacs key bindings: see  help keys

Object in javascript z ={a: 1} can be accessed by
z.a == z["a"] == (w="a"; z[w])

Rich in object types. See BSON specs

e.g.,

d = New Date()
ISODate("2012-10-11 7:14:12:23.788")
obj = {date: d}
  • Strings must be encodes UTF-8
  • ObjectId type can be used as identifier

Pymongo

Install with pip, .e.g,

 pip install https://github.com/mongodb/mongo-python-driver/archive/3.0b1.tar.gz 

Minimal example

import pymongo
from pymongo import MongoClient

connection = MongoClient('localhost', 27017)
db = connection.test
names = db.names
item = names.find_one()
print item['name']

load web data

Get some data

wget http://www.reddit.com/r/technology/json > reddit.json

cat reddit.json| python -m json.tool| more

In python, user urllib2, get_reddit_data.py

import json
import urllib2
import pymongo

connection = pymongo.MongoClient("mongodb://localhost")
db=connection.reddit
stories = db.stories
stories.drop()

reddit_page = urllib2.urlopen("http://www.reddit.com/r/technology")
parsed = json.loads(reddit_page.read())

for item in parsed['data']['children']:
	stories.insert_one(item['data'])

Regex

Finding regit post from Apple and Google
reddit_regex.py

import pymongo
connection = pymongo.MongoClient("mongodb://localhost")
db=connection.reddit
stories = db.stories
find();

def find():
	print "find, stories for apple and google"
	q = {'title':{'$regex': 'apple|google', '$options': 'i'}}
	p = {'title':1, '_id': 0}
	try
		cursor = stories.find(q, p)
	except Exception as e:
		print "Unexpected error: ", type(e), e
	for doc in cursor
		print doc

sort, skip and limit

Always happens in this order, regardless the composition order

SORT > SKIP > LIMIT

e.g. get 5th element from last (); same results

cursor = scores.find(query).sort('s_id', pymongo.ASCENDING).skip(4).limit(1)

cursor2 = scores.find(query).skip(4)
cursor2 = cursor2.limit(1)
cursor2 = cursor2.sort('s_id', pymongo.ASCENDING)

for doc in cursor
	print doc

for doc2 in cursor2
	print doc2

complex sorting

scores.sort([('s_id', pymongo.ASCENDING), ('score', pymongo.ASCENDING)])

Insert, update, deleting

pymongo vs server

Insert

  • insert_one vs. insert
  • insert_many vs. bulk

Update

  • update_one vs. update
  • update_many vs. update
  • replace_one vs. update

Delete

  • delete_one vs. remove
  • delete_many

insert_one

_id must be unique or autofilled

people.insert_one({'name':'claire'})
people.insert_one({'_id': 1,  'name':'mark'})

insert_many

the second argument ordered, makes insertion in strict order. If one fails, subsequent will fail

people.insert_many(collection_insert, ordered=True)	

insert_ordered default is True

update_one, update_many

update_one  + filter, updates($operator)
update_many + filter, updates($operator)

	res = scores.update_one({'id': my_id}, 
					{'$set': {'timestamp' : datetime.datetime.utcnow() }}	)

	print res.matched_count

implemented with the same underlying function

replace_one

updates entire document. Useful by:

  1. Pulling all documennt
  2. Making changes
  3. saving back to the document

set null all timestamps from all records

record  = scores.findOne({...})

// vulnerable block

id = record['_id'];
record['timestamp'] = null


// end block
scores.replace_one({'_id': id} record)

all updates and replace uses udpdate command under the hood

update_one or update_many, you must specify a $operator of some sort.

if you use replace_one, no $operator needed

upserts

if match
-> update
else
insert

things.update_one ( {..}, {..}, upsert=True)
things.update_many( {..}, {..}, upsert=True)
things.replace_one( {..}, {..}, upsert=True)

NOTE: for replace_one upsert only inserts second argument object, not filter

e.g

{_id:"bat", friend:'ball', cousin:'glove'} // equivalent

stuff.update_one({'_id':'bat'}, {'$set':{'friend':'ball', 'cousin':'glove'}}, upsert=True)

stuff.update_one({'_id':'bat'}, {'$set':{'_id':'bat', 'friend':'ball', 'cousin':'glove'}}, upsert=True)

stuff.replace_one({'_id':'bat'}, {'friend':'ball', 'cousin':'glove'}, upsert=True)

delete_one

res = scores.delete_one({..})
res.deleted_count
1

res = scores.delete_many({..})
res.deleted_count
3

find_one_and _***

Combining the document state before or after updates

// gets the document AFTER the update

counter = find_one_and_update(	filter={'type': 'uid'},
							 	update={'$inc': {'value': 1}},
							 	upsert=True, 
							 	return_document=pymongo.ReturnDocument.AFTER
							 	)

Actions are atomic, you can make changes and get the value in one operation

Use find_and_modify under the hood

Examples

Find Student with scores greater or equal to  65 in ascending order

db.grades.find({score: {$gte: 65}}).sort({score: 1 })

db.grades.aggregate(
	{'$group':{'_id':'$student_id', 'average':{$avg:'$score'}}}, 
	{'$sort':{'average':-1}}, 
	{'$limit':1})

Delete lowest homework score for each student

  1. filter homeworks
  2. sort student, then sort score
  3. delete when record student_id changes
db.grades.find({"type": "homework"})
		.sort({"student_id": 1, "score": 1})

using the zip codes dataset

Where each record looks like:

{
	"_id" : "35004",
	"city" : "ACMAR",
	"loc" : [
		-86.51557,
		33.584132
	],
	"pop" : 6055,
	"state" : "AL"
}

and some cities have multiple zip codes, i.e., city field is not unique

Sum population of all cities where the city name is a digit

db.zips.aggregate([
    {"$match": {"city": {$regex: '[0-9]+'}}},
   	{"$group": {"_id": null, "total": {$sum: "$pop"}}}
])

Or you could operate on the string using:

db.zips.aggregate([
    {"$project": 
     {
     	"city": "$city",
		"first_char": {"$substr" : ["$city",0,1]},
		"pop": "$pop"
     }	 
   }
])