MongoDB
Intro
- MongoDB is document oriented.
- MongoDB is schemaless.
- No joins, no transactions (scales horizontally)
Unlike in a relational database, you are not constrained to follow any particular schema. Do any:
- Begin inserting documents with the new schema.
- Perform a bulk update on the existing documents.
- Begin updating old documents to the new schema one by one at an appropriate event
(such as getting read from or written to), as coded in the application.
MongoDB does not support joins as a design decision because they do not scale horizontally, and it does not support SQL because that query language was built around joins and transactions
Building an App
Create a collection
use test
and save stuff to it (save will overwrite if an _id is given)
db.things.save({ a: 1, b : 1, fruit: ["apple", "orange", "pear"]})
db.mycol.save({_id: 1, title: "overview", url:'google.com', tags: [], likes:200 })
db.mycol.insert({title: "overview", url:'google.com', tags: ['uno', 'dos'], likes:100 })
delete records
db.mycol.remove({_id: 1})
filtering (&& is comma, || is $or)
db.mycol.find({"likes":100})
db.mycol.find({"likes":{$ne:101}, title: "overview"})
db.mycol.find({or: ["likes": {$gt:50},{"_id":{$nq: null}} ]})
db.mycol.find({"title": "overview", $or: [{"likes": {$gt:50}}, {"_id": {$ne: null}} ]})
other find functions
db.mycol.findOne()
Add anything to a collection as far is JSON (no quotes is OK)
db.things.find()
{ "_id" : ObjectId("5567d9251f523c3d2f2fc579"), "a" : 1, "b" : 2, "c" : 3 }
{ "_id" : ObjectId("5567d9571f523c3d2f2fc57a"), "a" : 3, "b" : 4, "c" : 3, "d" : 5 }
{ "_id" : ObjectId("5567dc45fd1f5b47c9886a05"), "a" : 1, "b" : 1, "fruit" : [ "apple", "orange", "pear" ] }
Better output
db.things.find().pretty()
{ "_id" : ObjectId("5567d9251f523c3d2f2fc579"), "a" : 1, "b" : 2, "c" : 3 }
{
"_id" : ObjectId("5567d9571f523c3d2f2fc57a"),
"a" : 3,
"b" : 4,
"c" : 3,
"d" : 5
}
{
"_id" : ObjectId("5567dc45fd1f5b47c9886a05"),
"a" : 1,
"b" : 1,
"fruit" : [
"apple",
"orange",
"pear"
]
}
{
"_id" : ObjectId("5567dd54bed3472b71fbf3fe"),
"name" : "andrew",
"address" : {
"street" : "elm drive",
"city" : "Palo Alto",
"zipcode" : "94126",
"houser_number" : 34
}
}
These are all valid JSON inputs
{ a : 1, b : 2, c : 3 }
{ a : 1, b : 2, c : [ 1, 2, 3, 4, 5 ] }
{ a : 1, b : { }, c : [ { a : 1, b : 2 }, 5, 6 ] }
{ }
Installing on Debian
Based on the official documentation mongodb.org
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10
echo "deb http://repo.mongodb.org/apt/debian "$(lsb_release -sc)"/mongodb-org/3.0 main" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.0.list
if not pre-installed
sudo apt-get install lsb-release
Install latest stable release
sudo apt-get install -y mongodb-org
Fire up mongod
sudo service mongod start
Troubleshooting
May need to remove lock data file
rm /var/lib/mongodb/mongod.lock
May need to set locales etc.
export LC_ALL=C
export LC_CTYPE="UTF-8"
export LANG="en_US.UTF-8"
About the 32-bit implementation
A note about the 32-bit implementation of MongoDB, if you're curious: It can safely handle up to around 2 GB of data, but if you do the math, 2^32 bytes = 4.3 gigabytes of data. Why the discrepancy? Well, the 4.3 GB is a good predictor of the largest total size of the data files you can get, but in fact, it is unlikely that a production implementation would be able to use even that much.
The 4.3 GB limit is true when we talk about the size of the data files, but often MongoDB doesn't use space 100% efficiently. It trades some space for some speed. As we add documents, we also (by default) add "padding", or empty space into which MongoDB can grow.
In MongoDB 3.0, MongoDB uses power of 2 sizes record allocation strategy by default. The record size is a power of 2 (32 bytes, 64 bytes, 128 bytes, etc). Any document created will fill the smallest such record that can contain it. This makes for fewer moves, and a good likelihood that a record will fit the space vacated by another record. Overall database performance improves, but at the cost to disk space. Plus there is other metadata to consider, such as indexes and the operations log. Add it all together, and it's very likely that, somewhere above 2 gigabytes but below 4 GB, a 32-bit implementation of MongoDB will run out of space. This is why it shouldn't be used in production.
CRUD operations
insert, find, update, delete:
db.people.insert(john)
db.people.insert(maria)
db.people.find()
{"_id": ObjectId("2883..."), "name": "john", "age": 10}
{"_id": ObjectId("2312..."), "name": "maria", "age": 10}
db.people.findOne({"age": 10})
{"_id": ObjectId("2883..."), "name": "john", "age": 10}
db.people.findOne({"age": 10}, {"_id": false, "name": true, "age": false})
{"name": "john"}
Querying
multiple fiels in find, equivalent to &&
db.scores.find({"_id:" 19, "type": "essay"} )
using $gt and $lt, $gte, $lte
db.scores.find({"scores": { $gt: 95, $lte: 98},"type": "essay"})
inequalities on strings are resolved lexicographically
db.scores.find({"name": { $gte: "G"}}
Gael
Gabriel
...
operators exists, types, and regex
db.people.find({profession: {$exists : true}})
db.people.find({name: {$type : 2}}) //2 = strings, see BSON docs
db.people.find({name: {$regex : "e$", $regex: "^A"}}) //starts with A and last letter is e
operator $or
db.people.find({ $or: [{name: "john"}, {age : 10}]})
operator $and (infrequent)
db.people.find({ $and: [{name: "john"}, {age : 10}]})
Note: for the below example the second condition overwrites the first one, not AND operation
db.scores.find({score: {$gt : 50}, {score: {$lt : 60}})
querying inside arrays
db.people.find({classes: "maths"})
{"_id": ObjectId("2312..."), "name": "maria", "age": 10, "classes": ["maths", "physEd"]}
No recursive depth in case of nested documents!
using $in and $all (arrays)
// includes all of them
db.people.find({favorites: { $all: ["maths", "physEd"]}})
{"_id": ObjectId("2312..."), "name": "maria", "age": 10, "classes": ["maths", "physEd"]}
{"_id": ObjectId("2312..."), "name": "mario", "age": 12, "classes": ["physEd","maths", "art"]}
// includes one or the other
db.people.find({favorites: { $in: ["art", "maths"]}})
{"_id": ObjectId("2312..."), "name": "maria", "age": 10, "classes": ["maths", "physEd"]}
{"_id": ObjectId("2312..."), "name": "mario", "age": 12, "classes": ["physEd","maths", "art"]}
dot notation
// must match strict order
db.people.find({"phone": {
"work": "12355606451",
"home": "12878783556"
}});
{
ObjectId("2312..."),
"name": "pete",
"phone": {
"work": "12355606451",
"home": "12878783556"
}
}
# look inside nested documents using dot
db.people.find({"phone.work": "work": "12355606451"});
querying, cursors
// hold that cursor
cur = db.people.find(); null;
// methods
//check for end of cursor
cur.hasNext()
//step through
cur.next()
//set cursor step size with next()
cur.limit(5); null
//sort
cur.sort({name: -1}); null
//chaining
cur.sort({name: -1}).limit(5); null
actually processed in db after used; null is just a shell hack
//skip the first 2 elements
cur.sort({name: -1}).limit(5).skip(2); null
counting
db.people.count({age: 10})
Updating
A single element update
{"_id": ObjectId("2312..."), "name": "mario", "age": 12, "classes": ["physEd","maths", "art"]}
db.people.update({name: "mario"}, {age: 20})
{"_id": ObjectId("2312..."), "name": "mario", "age": 20}
IMPORTANT: update replaces the whole record, fields not mapped will be lost!
using $set, $inc
db.people.update({name: "mario"}, {$set: {age: 20}})
{"_id": ObjectId("2312..."), "name": "mario", "age": 20, "classes": ["physEd","maths", "art"]}
db.people.update({name: "mario"}, {$inc: {age: 5}})
{"_id": ObjectId("2312..."), "name": "mario", "age": 25, "classes": ["physEd","maths", "art"]}
If the field doesn't exists, it'll be created
using unset
db.people.update({name: "mario"}, {$unset: {classes: 1}})
{"_id": ObjectId("2312..."), "name": "mario", "age": 20}
the value used 1, is arbitrary and is actually ignored
Array manipulation with $push, $pop, $pull, $pushAll, $pullAll, $addToSet
db.arrays.insert(_id: 10, a: [1,2,3,4])
db.arrays.update({_id: 10}, {$set {"a.2": 33}})
db.arrays.update({_id: 10}, {$push{"a": 55}})
//removes right most element
db.arrays.update({_id: 10}, {$pop{"a": 1}})
//removes left most element
db.arrays.update({_id: 10}, {$pop{"a": -1}})
db.arrays.update({_id: 10}, {$pushAll{"a": [55, 66, 77]}})
//removes the value, regardless its position
db.arrays.update({_id: 10}, {$pull{"a": 5}})
db.arrays.update({_id: 10}, {$pull{"a": [2, 5, 6]}})
// treat array as a set; subsequent calls have no effect
db.arrays.update({_id: 10}, {$addToSet {"a": 5}})
Upserts
Updates and existing record, otherwise create it (to use only with concrete values)
db.people.update({_id: 10}, {$set : {age: 10}}, {upsert: true})
Multiupdate
Must pass option: multi, {} matches all elements
default behaviour is to update only ONE element in the collection
db.people.update({}, { $set: {surname: "doe"}}, {multi: true})
Single thread for each operation, might yield to other write operations.
e.g. Update all records with score less than 70 and add 20 points
db.scores.update({'score': {$lt: 70}}, {$inc: {'score': 20}}, {multi: true})
Removing
db.people.remove({name: "alex"})
remove all
db.people.remove({})
// more efficient plus removes metadata
db.people.drop();
Mongo Shell
Essentially, a javascript runtime
- evaluates javascript, well-documented and easy to pickup
Autocompletion
Emacs key bindings: see help keys
Object in javascript z ={a: 1}
can be accessed byz.a == z["a"] == (w="a"; z[w])
Rich in object types. See BSON specs
e.g.,
d = New Date()
ISODate("2012-10-11 7:14:12:23.788")
obj = {date: d}
- Strings must be encodes UTF-8
- ObjectId type can be used as identifier
Pymongo
Install with pip, .e.g,
pip install https://github.com/mongodb/mongo-python-driver/archive/3.0b1.tar.gz
Minimal example
import pymongo
from pymongo import MongoClient
connection = MongoClient('localhost', 27017)
db = connection.test
names = db.names
item = names.find_one()
print item['name']
load web data
Get some data
wget http://www.reddit.com/r/technology/json > reddit.json
cat reddit.json| python -m json.tool| more
In python, user urllib2, get_reddit_data.py
import json
import urllib2
import pymongo
connection = pymongo.MongoClient("mongodb://localhost")
db=connection.reddit
stories = db.stories
stories.drop()
reddit_page = urllib2.urlopen("http://www.reddit.com/r/technology")
parsed = json.loads(reddit_page.read())
for item in parsed['data']['children']:
stories.insert_one(item['data'])
Regex
Finding regit post from Apple and Google
reddit_regex.py
import pymongo
connection = pymongo.MongoClient("mongodb://localhost")
db=connection.reddit
stories = db.stories
find();
def find():
print "find, stories for apple and google"
q = {'title':{'$regex': 'apple|google', '$options': 'i'}}
p = {'title':1, '_id': 0}
try
cursor = stories.find(q, p)
except Exception as e:
print "Unexpected error: ", type(e), e
for doc in cursor
print doc
sort, skip and limit
Always happens in this order, regardless the composition order
SORT > SKIP > LIMIT
e.g. get 5th element from last (); same results
cursor = scores.find(query).sort('s_id', pymongo.ASCENDING).skip(4).limit(1)
cursor2 = scores.find(query).skip(4)
cursor2 = cursor2.limit(1)
cursor2 = cursor2.sort('s_id', pymongo.ASCENDING)
for doc in cursor
print doc
for doc2 in cursor2
print doc2
complex sorting
scores.sort([('s_id', pymongo.ASCENDING), ('score', pymongo.ASCENDING)])
Insert, update, deleting
pymongo vs server
Insert
- insert_one vs. insert
- insert_many vs. bulk
Update
- update_one vs. update
- update_many vs. update
- replace_one vs. update
Delete
- delete_one vs. remove
- delete_many
insert_one
_id must be unique or autofilled
people.insert_one({'name':'claire'})
people.insert_one({'_id': 1, 'name':'mark'})
insert_many
the second argument ordered, makes insertion in strict order. If one fails, subsequent will fail
people.insert_many(collection_insert, ordered=True)
insert_ordered default is True
update_one, update_many
update_one + filter, updates($operator)
update_many + filter, updates($operator)
res = scores.update_one({'id': my_id},
{'$set': {'timestamp' : datetime.datetime.utcnow() }} )
print res.matched_count
implemented with the same underlying function
replace_one
updates entire document. Useful by:
- Pulling all documennt
- Making changes
- saving back to the document
set null all timestamps from all records
record = scores.findOne({...})
// vulnerable block
id = record['_id'];
record['timestamp'] = null
// end block
scores.replace_one({'_id': id} record)
all updates and replace uses udpdate command under the hood
update_one or update_many, you must specify a $operator of some sort.
if you use replace_one, no $operator needed
upserts
if match
-> update
else
insert
things.update_one ( {..}, {..}, upsert=True)
things.update_many( {..}, {..}, upsert=True)
things.replace_one( {..}, {..}, upsert=True)
NOTE: for replace_one upsert only inserts second argument object, not filter
e.g
{_id:"bat", friend:'ball', cousin:'glove'} // equivalent
stuff.update_one({'_id':'bat'}, {'$set':{'friend':'ball', 'cousin':'glove'}}, upsert=True)
stuff.update_one({'_id':'bat'}, {'$set':{'_id':'bat', 'friend':'ball', 'cousin':'glove'}}, upsert=True)
stuff.replace_one({'_id':'bat'}, {'friend':'ball', 'cousin':'glove'}, upsert=True)
delete_one
res = scores.delete_one({..})
res.deleted_count
1
res = scores.delete_many({..})
res.deleted_count
3
find_one_and _***
Combining the document state before or after updates
// gets the document AFTER the update
counter = find_one_and_update( filter={'type': 'uid'},
update={'$inc': {'value': 1}},
upsert=True,
return_document=pymongo.ReturnDocument.AFTER
)
Actions are atomic, you can make changes and get the value in one operation
Use find_and_modify under the hood
Examples
Find Student with scores greater or equal to 65 in ascending order
db.grades.find({score: {$gte: 65}}).sort({score: 1 })
db.grades.aggregate(
{'$group':{'_id':'$student_id', 'average':{$avg:'$score'}}},
{'$sort':{'average':-1}},
{'$limit':1})
Delete lowest homework score for each student
- filter homeworks
- sort student, then sort score
- delete when record student_id changes
db.grades.find({"type": "homework"})
.sort({"student_id": 1, "score": 1})
using the zip codes dataset
Where each record looks like:
{
"_id" : "35004",
"city" : "ACMAR",
"loc" : [
-86.51557,
33.584132
],
"pop" : 6055,
"state" : "AL"
}
and some cities have multiple zip codes, i.e., city field is not unique
Sum population of all cities where the city name is a digit
db.zips.aggregate([
{"$match": {"city": {$regex: '[0-9]+'}}},
{"$group": {"_id": null, "total": {$sum: "$pop"}}}
])
Or you could operate on the string using:
db.zips.aggregate([
{"$project":
{
"city": "$city",
"first_char": {"$substr" : ["$city",0,1]},
"pop": "$pop"
}
}
])