The Database Backend
Previously, in order to use Django's ORM with the App Engine Datastore, django-nonrel was required, along with djangoappengine. That's now changed. With Djangae you can use vanilla Django with the Datastore. Heavily inspired by djangoappengine (thanks Waldemar!) Djangae provides an intelligent database backend that allows vanilla Django to be used, and makes use of many of the Datastore's speed and efficiency features such as projection queries.
Here's the full list of magic:
- Database-level enforcement of unique and unique_together constraints.
- A transparent caching layer for queries which return a single result (
.get
or any query filtering on a unique field or unique-together fields). This helps to avoid Datastore consistency issues. - Automatic creation of additional index fields containing pre-manipulated values, so that queries such as
__iexact
work out of the box. These index fields are created automatically when you use the queries. Usesettings.GENERATE_SPECIAL_INDEXES_DURING_TESTING
to control whether that automatic creation happens during tests. - Support for queries which weren't possible with djangoappengine, such as OR queries using
Q
objects. - A collection of Django model fields which provide useful functionality when using the Datastore. A
ListField
,SetField
,RelatedSetField
,ShardedCounterField
andJSONField
. See the Djangae Model Fields for full details.
Roadmap
1.0-beta
- Support for ancestor queries. Lots of tests
What Can't It Do?
Due to the limitations of the App Engine Datastore (it being a non-relational database for a start), there are some
things which you still can't do with the Django ORM when using the djangae backend. The easiest way to find these out
is to just build your app and look out for the NotSupportedError
exceptions.
Notable Limitations
Here is a brief list of hard limitations you may encounter when using the Djangae datastore backend:
bulk_create()
is limited to 25 instances if the model has active unique constraints, or the instances being inserted have the primary key specified. In other cases the limit is 1000.filter(field__in=[...])
queries are limited to 100 entries (by default) in the list iffield
is not the primary keyfilter(pk__in=[...])
queries are limited to 1000 entries- You are limited to a single inequality filter per query, although excluding by primary key is not included in this count
- Queries without primary key equality filters are not allowed within an
atomic
block - Queries with an inequality filter on a field must be ordered first by that field
- Only 25 individual instances can be retrieved or saved within an
atomic
block, although you can get/save the same entity multiple times without increasing the allowed count - Primary key values of zero are not allowed
- Primary key string values must not start with a leading double underscore (
__
) ManyToManyField
will not work reliably/efficiently - useRelatedSetField
orRelatedListField
instead- Transactions. The Datastore has transactions, but they are not "normal" transactions in the SQL sense. Transactions should be done using djangae.db.transactional.atomic.
- If unique constraints are enabled, then you are limited to a maximum of 25 unique or unique_together constraints per model (see Unique Constraint Checking).
- You are also restricted to altering 12 unique field values on an instance in a single save
select_related
does nothing. It is ignored when specified as joins are not possible on the datastore. This can result in slow performance on queries which are not designed for the datastore.prefetch_related
works correctly however.
There are probably more but the list changes regularly as we improve the datastore backend. If you find another limitation not mentioned above please consider sending a documentation PR.
Other Considerations
When using the Datastore you should bear in mind its capabilities and limitations. While Djangae allows you to run Django on the Datastore, it doesn't turn the Datastore into a relational database. There are things which the datastore is good at (e.g. handling huge bandwidth of reads and writes) and things which it isn't good at (e.g. counting). Djangae is not a substitute for knowing how to use the Datastore.
Using Other Databases
You can use Google Cloud SQL or sqlite (locally) instead of or along side the Datastore.
Note that the Database backend and settings for the Datastore remain the same whether you're in local development or on App Engine Production, djangae switches between the SDK and the production datastore appropriately. However, with Cloud SQL you will need to switch the settings yourself, otherwise you could find yourself developing on your live database!
Here's an example of how your DATABASES
might look in settings.py if you're using both Cloud SQL and the Datastore.
from djangae.environment import is_development_environment
DATABASES = {
'default': {
'ENGINE': 'djangae.db.backends.appengine'
}
}
if not is_development_environment():
DATABASES['sql'] = {
'ENGINE': 'django.db.backends.mysql',
'HOST': '/cloudsql/YOUR_GOOGLE_CLOUD_PROJECT:YOUR_INSTANCE_NAME',
'NAME': 'YOUR_DATABASE_NAME',
'USER': 'root',
}
else:
DATABASES['sql'] = {
'ENGINE': 'django.db.backends.sqlite3',
'NAME': 'development.sqlite3'
}
See the Google documentation for more information on connecting to Cloud SQL via the MySQL client and from external applications.
Datastore Caching
Djangae has a built-in caching layer, similar to the one built into NDB - only better! You shouldn't even notice the caching layer at work, it's fairly complex and to understand the behaviour you are best reading through the caching tests. But here's a general overview:
- There are two layers of caching, the context cache and the memcache cache
- When possible, if you get/save an entity it will be cached by it's primary key value, and it's unique constraint combinations
- This protects against HRD inconsistencies in many situations, and it happens automagically
- The caching layer is heavily tied into the transaction.atomic decorator. If you use the db.RunInTransaction stuff you are going to have a hard time, so don't do that!
- You can disable the caching by using the
disable_cache
context manager/decorator.disable_cache
takes two boolean parameters,context
andmemcache
so you can configure which caches you want disabled. Be careful though, don't toggle the caching on and off too much or you might get into trouble (I'm sure there's a situation you can break it but I haven't figured out what it is) - The context cache has a complex stack structure, when you enter a transaction the stack is pushed, and when you leave a transaction it's popped. This is to ensure the cache gives you the right results at the right time
- The context cache is cleared on each request, and it's thread-local
- The memcache cache is not cleared, it's global across all instances and so is updated only when a consistent Get/Put outside a transaction is made
- Entities are evicted from memcache if they are updated inside a transaction (to prevent crazy)
The following settings are available to control the caching:
DJANGAE_CACHE_ENABLED
(defaultTrue
). Setting to False it all off, I really wouldn't suggest doing that!DJANGAE_CACHE_TIMEOUT_SECONDS
(default60 * 60
). The length of time stuff should be kept in memcache.DJANGAE_CACHE_MAX_CONTEXT_SIZE
(default1024 * 1024 * 8
). This is (approximately) the max size of a local context cache instance. Each request and each nested transaction gets its own context cache instance so be aware that this total can rapidly add up, especially on F1 instances. If you have an F2 or F4 instance you might want to increase this value. If you hit the limit the least used entities will be evicted from the cache.DJANGAE_CACHE_MAX_ENTITY_COUNT
(default 8). This is the max number of entities returned by a pk__in query which will be cached in the context or memcache upon their return. If more results than this number are returned then the remainder won't be cached.
Datastore Behaviours
The Djangae database backend for the Datastore contains some clever optimisations and integrity checks to make working with the Datastore easier. This means that in some cases there are behaviours which are either not the same as the Django-on-SQL behaviour or not the same as the default Datastore behaviour. So for clarity, below is a list of statements which are true:
General Behaviours
- Doing
MyModel.objects.create(primary_key_field=value)
will do an insert, so will explicitly check that an object with that PK doesn't already exist before inserting, and will raise an IntegrityError if it does. This is done in a transaction, so there is no need for any kind of manual transaction or existence checking. - Doing an
update(field=new_value)
query is transactionally safe (i.e. it uses transactions to ensure that only the specified field is updated), and it also automatically avoids the Stale objects issue (see Eventual consistency below) so it only updates objects which definitely match the query. But it still may suffer from the Missing objects issue. See notes about the speed ofupdate()
queries in the Speed section below. - A
.distinct()
query is only possible if the query can be done as a projection query (see 'Speed' section below).
Eventual Consistency
See App Engine documentation for background.
The Datastore's eventual consistency behaviour gives us 2 main issues:
- Stale objects: This is where querying for objects by a non-primary key field may return objects which no longer match the query (because they were recently modified) or which were recently deleted.
- Missing objects: This is where querying for objects by a non-primary key may not return recently created or recently modified objects which do match the query.
There are various solutions and workarounds for these issues.
- If
pk__in
is used in the query (with or without other filters) then the query will be consistent and will returning all matching objects and will not return any non-matching objects. - Accessing the queryset of a
RelatedSetField
orRelatedListField
automatically gives you the consistency of apk__in
filter (because that's exactly what it's doing underneath). Somy_obj.my_related_set_field.all()
is consistent. - To avoid the Stale objects issue, you can do an initial
values_list('pk')
query and pass the result to a second query, e.g.MyModel.objects.filter(size='large', pk__in=list(MyModel.objects.filter(size='large').values_list('pk', flat=True)))
. Notes:- This causes 2 queries, so is slightly slower, although the nested
values_list('pk')
query is fast as it uses a Datastore keys-only. - You need to cast the nested PKs query to list, as otherwise Django will try to combine the inner query as a subquery, which the Datastore cannot handle.
- You need to include the additional filters (in this case
size=large
) in both the inner and outer queries. - This technique only avoids the Stale objects issue, it does not avoid the Missing objects issue.
- This causes 2 queries, so is slightly slower, although the nested
djangae.db.consistency.ensure_instance_consistent
It's very common to need to create a new object, and then redirect to a listing of all objects. This annoyingly falls foul of the
datastore's eventual consistency. As a .all() query is eventually consistent, it's quite likely that the object you just created or updated
either won't be returned, or if it was an update, will show stale data. You can fix this by using djangae.contrib.consistency or if you
want a more lightweight approach you can use djangae.db.consistency.ensure_instance_consistent
like this:
queryset = ensure_instance_consistent(MyModel.objects.all(), updated_instance_pk)
Be aware though, this will make an additional query for the extra object (although it's very likely to hit the cache). There are also caveats:
- If no ordering is specified, the instance will be returned first
- Only ordering on the queryset is respected, if you are relying on model ordering the instance may be returned in the wrong place (patches welcome!)
- This causes an extra iteration over the returned queryset once it's retrieved
There is also an equivalent function for ensuring the consistency of multiple items called ensure_instances_included
.
Speed
- Using a
pk__in
filter in addition to other filters will usually make the query faster. This is because Djangae uses the PKs to do a DatastoreGet
operation (which is much faster than a DatastoreQuery
) and then does the other filtering in Python. - Doing
.values('pk')
or.values_list('pk')
will make a query significantly faster because Djangae performs a keys-only query. - Doing
.values('other')
type queries will be faster if Djangae is able to perform a Datastore projection query. This is only possible if:- None of the fetched fields are also being filtered on (which would be a weird thing to do anyway).
- The query is not ordered by primary key.
- All of the fetched fields are indexed by the Datastore (i.e. are not list/set fields, blob fields or text (as opposed to char) fields).
- The model has not got concrete parents.
- Doing an
.only('foo')
or.defer('bar')
with apk_in=[...]
filter may not be more efficient. This is because we must perform a projection query for each key, and although we send them over the RPC in batches of 30, the RPC costs may outweigh the savings of a plain old datastore.Get. You should profile and check to see whether using only/defer results in a speed improvement for your use case. - Due to the way it has to be implemented on the Datastore, an
update()
query is not particularly fast, and other than avoiding calling thesave()
method on each object it doesn't offer much speed advantage over iterating over the objects and modifying them. However, it does offer significant integrity advantages, see General behaviours section above. - Doing filter(pk__in=Something.objects.values_list('pk', flat=True)) will implicitly evaluate the inner query while preparing to run the outer one. This means two queries, not one like SQL would do!
- IN queries and queries with OR branches which aren't filtered on PK result in multiple queries to the datastore. By default you will get an error
if you exceed 100 IN filters but this is configurable via the
DJANGAE_MAX_QUERY_BRANCHES
setting. Be aware that the more IN/OR filters in a query, the slower the query becomes. 100 is already a high value for this setting so raising it isn't recommended (it's probably better to rethink your data structure or querying)
Unique Constraint Checking
IMPORTANT: Make sure you read and understand this section before configuring your project
tl;dr Constraint checking is costly, you might want to disable it globally using settings.DJANGAE_DISABLE_CONSTRAINT_CHECKS
and re-enable on a per-model basis
Djangae by default enforces the unique constraints that you define on your models. It does so by creating so called "unique markers" in the datastore. Unique constraint checks have the following caveats...
- Unique constraints drastically increase your datastore writes. Djangae needs to create a marker for each unique constraint on each model, for each instance. This means if you have one unique field on your model, and you save() Djangae must do two datastore writes (one for the entity, one for the marker)
- Unique constraints increase your datastore reads. Each time you save an object, Djangae needs to check for the existence of unique markers.
- Unique constraints slow down your saves(). See above, each time you save, a bunch of stuff needs to happen.
- Updating instances via the datastore API (NDB, DB, or datastore.Put and friends) will break your unique constraints. Don't do that!
- Updating instances via the datastore admin will do the same thing, you'll be bypassing the unique marker creation.
- There is a limit of 25 unique or unique_together constraints per model.
However, unique markers are very powerful when you need to enforce uniqueness. They are enabled by default simply because that's the behaviour that Django expects. If you don't want to use this functionality, you have the following options:
- Don't mark fields as unique, or in the meta unique_together - this only works for your models, contrib models will still use unique markers
-
Disable unique constraints on a per-model basis via the Djangae meta class (again, only works on the model you specify)
class Djangae: disable_constraint_checks = True
-
Disable constraint checking globally via
settings.DJANGAE_DISABLE_CONSTRAINT_CHECKS
The disable_constraint_checks
per-model setting overrides the global DJANGAE_DISABLE_CONSTRAINT_CHECKS
so if you are concerned about speed/cost then you might want to disable globally and
override on a per-model basis by setting disable_constraint_checks = False
on models that require constraints.
On Delete Constraints
In general, Django's emulation of SQL ON DELETE constraints works with djangae on the datastore. Due to eventual consistency however, the constraints can fail. Take care when deleting related objects in quick succession, a PROTECT constraint can wrongly cause a ProtectedError when deleting an object that references a recently deleted one. Constraints can also fail to raise an error if a referencing object was created just prior to deleting the referenced one. Similarly, when using ON CASCADE DELETE (the default behaviour), a newly created referencing object might not be deleted along with the referenced one.
Transactions
Django's transaction decorators have no effect on the Datastore, which means that when using the Datastore:
django.db.transaction.atomic
andnon_atomic
will have no effect.- The
ATOMIC_REQUESTS
andAUTOCOMMIT
settings inDATABASES
will have no effect. - Django's
get_or_create
will not have the same behaviour when dealing with collisions between threads. This is because it use's Django's transaction manager rather than Djangae's. - If your
get
aspect filters by PK then you should wrapget_or_create
withdjangae.db.transaction.atomic
. A collision with another thread will result in aTransactionFailedError
. - If your
get
aspect filters by a unique field or unique-together fields, but not by PK, then (assuming you're using Djangae's unique markers) you won't experience any data corruption, but a collision with another thread will throw anIntegrityError
. - If your
get
aspect does not filter on any unique or unique-together fields then you should fix that.
The following functions are available to manage transactions:
djangae.db.transaction.atomic
- Decorator and Context Manager. Starts a new transaction, acceptedxg
,indepedendent
andmandatory
argsdjangae.db.transaction.non_atomic
- Decorator and Context Manager. Breaks out of any current transactions so you can run queries outside the transactiondjangae.db.transaction.in_atomic_block
- Returns True if inside a transaction, False otherwisedjangae.db.transaction.current_transaction
- Returns an object representing the currently active transaction
Do not use google.appengine.ext.db.run_in_transaction
and friends, it will break.
Writing safe Transactional code
Transactions on App Engine are a little strange and it's easy to get caught out by unexpected behaviour when using them. Take the following code for example:
MyModel.objects.create(pk=1, value=0)
with atomic():
instance = MyModel.objects.get(pk=1)
instance.value = 1
instance.save()
instance.refresh_from_db()
assert(instance.value == 1)
By default, this code would raise an AssertionError and the value of instance.value
would be zero. This is because until the moment that the transaction commits, all datastore reads will read the value from outside the transaction, not the value you saved from within the atomic block.
To make things behave in a more preditable way Djangae provides two features to you:
- The "Context Cache"
- The
Transaction
object
The context cache is enabled by default, and is a thread-level in-memory cache of entities that have been saved within the transaction. When the context cache
is enabled the above code would not raise an AssertionError. instance.value
would equal 1
, because refresh_from_db()
would hit the local cache and not
actually read from the Datastore.
There are times when it's necessary to read from the Datastore directly, and you can use djangae.db.caching.disable_cache()
for that.
The other feature that Djangae provides are Transaction
objects, and in particular, Transaction.has_been_read(instance)
, Transaction.has_been_written(instance)
and
Transaction.refresh_if_unread(instance)
:
instance = MyModel.objects.create(pk=1, value=0)
with atomic() as txn:
assert(not txn.has_been_read(instance))
txn.refresh_if_unread(instance)
assert(txn.has_been_read(instance))
instance.value = 1
instance.save()
Transactions in the Datastore generally need to be made up of a Get, and Put. If you don't do both within the atomic block, then you could overwrite
data by accident. It's important to understand that nested transactions don't exist, if you nest atomic()
decorators (without the independent
flag) your inner atomic()
call will be a no-op.
And for that reason, if you blindly call refresh_from_db() from an atomic()
function, and that function is called from another atomic()
function then
you can overwrite local changes:
@atomic()
def method1(self):
self.refresh_from_db()
self.value += 1
self.save()
@atomic()
def method2(self):
self.refresh_from_db()
self.value += 1
self.method1()
self.save()
instance.method1()
instance.refresh_from_db()
instance.value -> 1 # Oops!
You can write safer code by using refresh_if_unread(instance)
which will only update the instance if you haven't already done it in the current transaction.
refresh_if_unread()
also takes an optional keyword argument called prevent_further_reads
which adds additional protection against reading the same instance twice inside a transaction (see below).
Sometimes when working with transactions you might find you accidentally introduce an additional entity group into the
transaction unintentionally. This is very easily done by following a ForeignKey relationship inside the transaction and if
the related instance is shared by a large number of objects you'll rapidly see a large number of TransactionFailedError
s thrown.
The Djangae Transaction
object exposes a method to help protect against this situation called prevent_read
.
with transaction.atomic() as txn:
txn.prevent_read(MyModel, 1)
MyModel.objects.get(pk=1) # Raises PreventedReadError
Multiple Namespaces
It's possible to create separate "databases" on the datastore via "namespaces". This is supported in Djangae through the normal Django multiple database support. To configure multiple datastore namespaces, you can add an optional "NAMESPACE" to the DATABASES setting:
DATABASES = {
'default': {
'ENGINE': 'djangae.db.backends.appengine'
},
'archive': {
'ENGINE': 'djangae.db.backends.appengine'
'NAMESPACE': 'archive'
}
}
If you do not specify a NAMESPACE
for a connection, then the Datastore's default namespace will be used (i.e. no namespace).
You can make use of Django's routers, the using()
method, and the save(using='...')
in the same way as normal multi-database support.
Cross-namespace foreign keys aren't supported. Also namespaces effect caching keys and unique markers (which are also restricted to a namespace).
Special Indexes
The App Engine datastore backend handles certain queries which are unsupported natively, by adding hidden fields to the Datastore entities or by storing additional child entities. The mechanism for adding these fields and then using them during querying is called "special indexing".
For example, querying for name__iexact
is not supported by the Datastore. In this case Djangae generates an additional entity property with the name
value lower-cased, and then when performing an iexact query, will lower case the lookup value and use the generated column rather than the name
column.
When you run a query that requires special indexes for the first time, an entry will be added to a generated file called djangaeidx.yaml
. You will
see this file appear in your project root. From that point on, any entities that are saved will have the additional property added. If a new entry
appears in djangaeidx.yaml, you will need to resave all of your entities of that kind so that they will be returned by query lookups.
contains
and icontains
Filters
When you use __contains
or __icontains
in a query, the djangaeidx.yaml file will be updated so that all subsequent entity saves will generate an additional descendent entity per-instance-field to store
indexing data for that field. This approach will add an additional Put()
for each save()
and an additional Query()
for each __contains
look up.
Previously, Djangae used to store this index data on the entity itself which caused a number of problems that are now avoided:
- The index data had more permutations than were necessary. This was each set of possible characters had to be stored in a List property so that the lookup could use an equality query. Djangae couldn't rely on an inequality (which would allow storing fewer permutations) because that would greatly restrict the queries that a user could perform.
- The large number of permutations caused the entities to bloat with additional properties and it wasn't possible to filter them out when querying
which means every query (whether using
contains
or not) would transfer a large amount of data over the RPC, slowing down every single query on an instance which had contains data indexed. - The implementation was flawed. It was originally thought that list properties were limited to 500 entries, this may have been true at some point in datastore history but it's certainly not true now. Because of this incorrect assumption, indexed data was split across properties which made the code very confusing
For now, the legacy behaviour is available by setting DJANGAE_USE_LEGACY_CONTAINS_LOGIC = True
in your settings file. This setting
will be removed so it's recommended that upon upgrading to Djangae 0.9.10 you resave all of your entities (that use contains
) instead.
Resaving will not remove old indexed properties, we hope to provide a migration file in future that will do that for you.
Querying date
and datetime
Fields with contains
The same as when Django is used with a SQL database, Djangae's indexing allows contains
and icontains
filters to be used on DateField
and DateTimeField
. When this is used, the field values are converted to ISO format and then indexed as strings, allowing you to perform contains
queries on any part of the ISO format string.
Distributing djangaeidx.yaml
If you are writing a portable app, and your app makes queries which require special indexes, you can ship a custom djangaeidx.yaml in the root of your Django app. The indexes in this file will be combined with the user's main project djangaeidx.yaml at runtime.
Migrations
Djangae has support for migrating data using the Django migrations infrastructure. See Migrations.