repoze.catalog and ZODB beginners example – part 1

repoze.catalog and ZODB beginners example – part 1

Summary

The first of two posts which illustrate how to use repoze.catalog alongside ZODB

What’s ZODB ?

To quote the ZODB home page :

“The ZODB is a native object database, that stores your objects while allowing you to work with any paradigms that can be expressed in Python. Thereby your code becomes simpler, more robust and easier to understand”

What’s repoze.catalog ?

repoze.catalog is one of a number of frameworks which can be used to supply indexing for ZODB for those circumstances where accessing objects stored in ZODB would otherwise be unacceptably slow

Intended Audience

I’m assuming that readers of this post have a basic familiarity with ZODB. If you don’t there are lots of good resources out there of which ‘Example Driven ZODB‘ is a good example.

What’s the purpose of this post ?

For any reasonably experienced Python programmer using ZODB and repoze.catalog is pretty straightforward. Unfortunately a new user of repoze.catalog cannot find an example on the repoze.catalog site which shows both how to catalog items and save them into ZODB. This is understandable as repoze.catalog is not only for use with ZODB but I thought it was worthwhile doing a specific example for that scenario and that’s what on this page.

Where repoze.catalog helps ZODB

Because of the nature of ZODB it’s easy to access objects by the value they’re keyed on but otherwise it’s a question of a sequential search.

So instances of a class that look like this :

class Country(Persistent):
    def __init__(self, pop):
        self.name = name
        self.population = pop

Might be saved into the `root` property of a ZODB `connection` object like this :

root['un']['nz'] = Country('New Zealand', 4000000)

But subsequently if we wanted to obtain `Country` instances on the basis of population the key doesn’t help us at all and a scan of all `Country` objects would be necessary, like this :

for cou in root['un'].itervalues():
    if cou.population > 1000000:
        print cou.name

When we use repoze.catalog to catalogue a ZODB database we specify properties of objects that will be saved in ZODB and which interest us and by which will subsequently want to find the objects.

repoze.catalog allows us quickly and easily search for objects with property values that interest us.

Example Code

Here’s my example code and underneath I’ll expand a little more on what each part does:

'''
Demonstrates how to use repoze.catalog to catalogue objects
being stored in ZODB. This example has the catalog and ZODB
database as seperate repositories
'''
from repoze.catalog.catalog import FileStorageCatalogFactory
from repoze.catalog.catalog import ConnectionManager

from repoze.catalog.indexes.field import CatalogFieldIndex
from repoze.catalog.indexes.text import CatalogTextIndex

import transaction
from persistent import Persistent
from BTrees.OOBTree import OOBTree

from myzodb import MyZODB

factory = FileStorageCatalogFactory('../data/mdcatalog.db', 'mycatalog')

_initialized = False

def initialize_catalog():
    '''
    Create a repoze.catalog instance and specify
    indices of intereset

    NB: Use of global variable
    '''
    global _initialized
    if not _initialized:
        # create a catalog
        manager = ConnectionManager()
        catalog = factory(manager)
        # set up indexes
        catalog['names'] = CatalogTextIndex('name')
        catalog['populations'] = CatalogFieldIndex('population')
        # commit the indexes
        manager.commit()
        manager.close()
        _initialized = True

class City(Persistent):
    '''Represents a City by name and population'''
    def __init__(self, cityname, citypop):
        self.name = cityname
        self.population = citypop
    def __str__(self):
        return "%s  (Pop: %s)" % \
                (self.name, \
                str(self.population))

if __name__ == '__main__':
    initialize_catalog()
    manager = ConnectionManager()
    catalog = factory(manager)
    myzodbinstance = MyZODB('../data/mdzdb.fs')
    myzodbinstance.dbroot['cities'] = OOBTree()

    #For ease of demonstration set up a local dict
    #containing a number of `City` instances keyed
    #by a unique integer
    cities = {
        1:City('Windhoek', 322500),
        2:City('Pretoria', 525387),
        3:City('Nairobi', 3138295),
        4:City('Maputo', 1244227),
        5:City('Jakarta', 10187595),
        6:City('Canberra', 358222),
        7:City('Wellington', 393400),
        8:City('Santiago', 5428590),
        9:City('Buenos Aires', 2891082),
    }
    #Iterate over our local dict and for each
    #element generate the catlog entry for
    #repoze.catalog and add the corresponding
    #instance to the ZODB database we are
    #cataloguing
    for docid, doc in cities.items():
        catalog.index_doc(docid, doc)
        myzodbinstance.dbroot['cities'][docid] = doc
        transaction.commit()

    myzodbinstance.close()

Example Step by Step

Here’s a breakdown on what’s happening in the above example

Initialize repoze.catalog

initialize_catalog()
manager = ConnectionManager()
catalog = factory(manager)

The `initialize_catalog` function creates a repoze.catalog instance and initializes two indices : `names` and `populations`. These index the `name` and `population` properties of any objects indexed with the repoze.catalog instance just created

Make our ZODB database ready for use

myzodbinstance = MyZODB('../data/mdzdb.fs')

`MyZODB` is a convenience class which wraps up the instantiation of a ZODB database instance and provides : `storage`; `db`;`connection`; and `dbroot` properties to help the programmer interact with the ZODB database, connection, storage objects. `MyZODB` also provides a close method to cleanly close the ZODB database, connection and storage.

`MyZODB` is not explicitly included in the above example but it looks like this :

from ZODB import FileStorage, DB
class MyZODB(object):
    '''Manage the state of a ZODB FileStorage connection'''
    def __init__(self, path):
        self.storage = FileStorage.FileStorage(path)
        self.db = DB(self.storage)
        self.connection = self.db.open()
        self.dbroot = self.connection.root()
    def close(self):
        self.connection.close()
        self.db.close()
        self.storage.close()</pre>

Create a sub-tree in ZODB for our `City` objects

Now we have an instance of `MyZODB` we can treat the `dbroot` property (which corresponds to the ZODB `dbroot` property of the ZODB `connection` object) as a plain old dictionary and assign a value to it under some key of our choosing, for our example because we’re going to save a set of `City` objects we’ve chosen ‘cities’.

At this stage we just assign an instance of `OOBTree` to that ‘cities’ key. An OOBTree instance acts like a dictionary but, when a lot of elements are within it, works much more efficiently for the purposes of ZODB.

myzodbinstance.dbroot['cities'] = OOBTree()

Create a set of `City` objects

Now we pause for a moment and make ourselves a set of `City` objects and put them into a dictionary for later use.

What’s significant here is that the key used to save each `City` instance is a unique integer which has no specific meaning in itself, we’ll see why in a moment.

cities = {
1:City('Windhoek', 322500),
2:City('Pretoria', 525387),
3:City('Nairobi', 3138295),
4:City('Maputo', 1244227),
5:City('Jakarta', 10187595),
6:City('Canberra', 358222),
7:City('Wellington', 393400),
8:City('Santiago', 5428590),
9:City('Buenos Aires', 2891082),
}

Save `City` objects to ZODB and index them

Now at last we’re going to do what we’ve come for.

We iterate over our set of `City` instances and for each one we make use of the `index_doc` method of repoze.catalog . Notice that the two arguments are the integer we’ve arbitarily associated with each `City` instance, ‘docid’ in this example, and the `City` instance itself, ‘doc’ in this example. By using the `index_doc` method we update the catalog entries maintained by repoze.catalog

In the same interation we assign the `City` object instance, ‘doc’ to our `OOBTree` (stored under the ‘cities’ key of `dbroot`) using as an index the same integer we’ve just passed to the `index_doc` call.

Finally we make use of the ZODB Transaction manager to commit our changes. Because repoze.catalog is actually a ZODB database inside our single transaction is sufficient to commit both the catalog update and the actual update of the ZODB database.

for docid, doc in cities.items():
    catalog.index_doc(docid, doc)
    myzodbinstance.dbroot['cities'][docid] = doc
    transaction.commit()

Credit where credits due

The example I’ve shown here owes some parts to one of the examples on the repoze.catalog website. The structure of the `myZODB` was taken from the article cited above,  ‘Example Driven ZODB‘ . Lastly I got some useful advice in response to a question I posed on StackOverflow and I’m grateful to the people who provided answers .

In Closing

That’s all there is to it ! In many small scale instances there’s no need to do anything other than use ZODB as it comes and not worry about indexing – machines are fast and many applications deal with relatively small data sets however if you do need it repoze.catalog (or one of the other, similar, cataloguing tools) is a useful way to squeeze more speed out of ZODB.

This has been a very long blog post by my standards so I’m going to show how to access the data indexed under repoze.catalog (and prove that it all actually works !) in a blog post next week.

Leave a Reply

Your email address will not be published. Required fields are marked *