Azure Cosmos DB
Request Units, partition keys, and consistency levels β why your partition key is the most important architectural decision you'll make.
Azure Cosmos DB is a globally distributed, multi-model NoSQL database with guaranteed single-digit millisecond latency at any scale. Its core abstraction is a container partitioned by a key you choose. Cosmos distributes data and throughput across physical partitions automatically β but only if your partition key has sufficient cardinality. Get the partition key wrong and no amount of provisioned RU/s will save you from throttling.
Every item in Cosmos DB must include the partition key field. Cosmos groups items with the same partition key value into a logical partition. Logical partitions are mapped to physical partitions (each backed by a replica set of 4 nodes). A physical partition holds up to 50GB and serves up to 10,000 RU/s.
Every operation costs RUs: a 1KB point read = 1 RU, a 1KB write = ~5 RUs, a cross-partition query = 50β500+ RUs depending on result size and partition fan-out. You provision RU/s at the database or container level. Exceeding your provisioned RU/s returns HTTP 429 TooManyRequests.
Cosmos offers 5 consistency levels. Strong guarantees linearizable reads but doubles write latency (waits for all replicas). Eventual offers lowest latency but reads may see stale data. Session (default) provides read-your-writes for a single client session β the best balance for most apps.
By default Cosmos indexes every property. This makes any query possible but wastes RUs on writes. Exclude unused paths (large blobs, nested arrays you never query) using excludedPaths. Use composite indexes for ORDER BY with WHERE filters β without them, ORDER BY across a large container costs enormous RUs.
Set defaultTtl on the container and optionally ttl on each item. Cosmos automatically deletes expired items in the background β no RU charge for the deletes. Without TTL, logs, sessions, and audit records accumulate forever and inflate storage costs (Cosmos charges per GB stored).
Cosmos Change Feed is an append-only log of changes per container, ordered by modification time within each partition. It powers cache invalidation, event-driven projections, and real-time analytics. It does NOT capture deletes (unless using TTL delete pattern or Change Feed with full fidelity preview).
Key Concepts
Normalized throughput currency. Combines CPU, memory, and IOPS. A 1KB point read = 1 RU. Provisioned or serverless. Throttled at 429 when exceeded.
The field used to shard data. Must be high-cardinality (millions of values), immutable after item creation, and appear in your most frequent queries. Wrong choice = hot partition = 429s.
Logical: all items sharing a partition key value (max 20GB). Physical: a group of logical partitions on one set of replicas (max 50GB, 10K RU/s). Cosmos splits physical partitions automatically.
5 levels: Strong, Bounded Staleness, Session (default), Consistent Prefix, Eventual. Trades read/write latency for data freshness guarantees across regions.
Append-only ordered stream of all inserts/updates per container. Powers event-driven architectures. Does not capture deletes by default.
Enable writes to any configured region. Cosmos uses last-writer-wins (LWW) or custom conflict resolution. Latency drops to <10ms for users near any region.
Serverless: pay per RU consumed, no baseline cost β good for dev/test or spiky workloads. Provisioned: fixed RU/s allocated, predictable cost and performance β good for production.
Controls which paths are indexed. Default: all paths. Exclude large blobs to save write RUs. Add composite indexes for multi-field ORDER BY queries.
1// Azure Cosmos DB β @azure/cosmos SDK v42// Partition key design: high-cardinality /userId34import { CosmosClient, PartitionKeyKind } from "@azure/cosmos";56const client = new CosmosClient({7 endpoint: process.env.COSMOS_ENDPOINT!,8 key: process.env.COSMOS_KEY!,9});1011const { database } = await client.databases.createIfNotExists({ id: "myapp" });12const { container } = await database.containers.createIfNotExists({13 id: "orders",14 partitionKey: {15 paths: ["/userId"], // HIGH cardinality β millions of distinct values16 kind: PartitionKeyKind.Hash,17 },18 defaultTtl: 7776000, // 90-day TTL in seconds β prevents unbounded storage growth19 indexingPolicy: {20 automatic: true,21 indexingMode: "consistent",22 excludedPaths: [{ path: "/rawPayload/*" }], // exclude large blobs from index β saves RUs23 },24});2526// Point read β 1 RU. Fastest possible operation.27// Requires BOTH the item id AND the partition key value.28const { resource: order } = await container29 .item("ord-1234", "user-42") // (id, partitionKeyValue)30 .read<Order>();3132// Cross-partition query β expensive! Use sparingly.33// EnableScanInQuery lets it run; cost scales with partition count.34const { resources: recentOrders } = await container.items35 .query({36 query: "SELECT * FROM c WHERE c.status = @status AND c._ts > @cutoff",37 parameters: [38 { name: "@status", value: "pending" },39 { name: "@cutoff", value: Math.floor(Date.now() / 1000) - 3600 },40 ],41 })42 .fetchAll();4344// Change Feed β react to inserts/updates in real-time45const changeFeedIterator = container.items.getChangeFeedIterator({46 changeFeedStartFrom: "Beginning",47});48for await (const { result } of changeFeedIterator) {49 await processChanges(result); // drives downstream projections, cache invalidation50}
Cosmos DB can serve millions of requests per second across the globe with <10ms latency β but only if data is distributed correctly. A bad partition key concentrates all load on one physical partition (a 'hot partition'), which caps at 10,000 RU/s and 50GB regardless of how much throughput you've provisioned. The partition key decision is made at container creation and cannot be changed.
Common Pitfalls
1IoT telemetry platform β hot partition from /deviceType key
A smart building startup stores sensor readings from 50,000 devices in Cosmos DB, partitioned by /deviceType. There are 6 device types (temperature, humidity, CO2, motion, door, light). The platform worked fine in staging with 100 simulated devices but throttled constantly at 80k+ readings/minute in production.
With only 6 partition key values, nearly all writes concentrated in 2β3 physical partitions (most devices were temperature sensors). Those partitions hit the 10,000 RU/s physical partition cap and returned HTTP 429 continuously. Provisioning more RU/s at the container level didn't help because the cap is per physical partition.
Re-partitioned using a synthetic key: /partitionKey = deviceId + '_' + Math.floor(timestamp / 3600). This created ~50,000 Γ 24 = 1.2M distinct partition key values, distributing load evenly across hundreds of physical partitions. RU cost per write stayed the same; throttling dropped to zero.
Takeaway: Partition key cardinality must scale with your write throughput. Rule of thumb: you need at least as many logical partitions as you have peak concurrent writers. A /deviceType or /status key is almost always a hot partition waiting to happen in production.
2E-commerce order history β cross-partition query cost explosion
A marketplace added an admin dashboard that queries 'all orders in the last 24 hours with status=processing'. In staging it cost ~15 RU/query. In production with 2M items across 400 partitions, the same query started returning RequestChargeTooLarge errors and costing 8,000+ RUs per execution.
The query 'SELECT * FROM c WHERE c.status = @s AND c._ts > @t' has no partition key in the WHERE clause. Cosmos performs a fan-out query β it sends the query to every physical partition and merges results. With 400 partitions, the base fan-out cost multiplied by 400. The development environment had 5 partitions.
Introduced a materialized view pattern: a Change Feed processor writes order summaries to a separate 'order-status-projections' container partitioned by /statusDay (e.g., 'processing_2024-01-15'). Admin queries hit this container with the partition key, costing ~5 RUs regardless of data volume.
Takeaway: Cross-partition queries do NOT scale linearly β they scale with physical partition count, which grows as your data grows. Queries that cost 15 RUs in dev can cost 15,000 in production. Always identify cross-partition query patterns early and build partitioned projections for them.
3User session store β unbounded storage from missing TTL
A SaaS CRM used Cosmos DB to store user sessions (JWT data, UI preferences, last-seen state). After 18 months, the storage bill increased from $200/month to $4,800/month for the same active user count. The session container had grown to 12TB.
Sessions were written on every login but never explicitly deleted on logout (logout was client-side only). Inactive users accumulated sessions forever. No TTL was configured. Cosmos charges $0.25/GB/month β 12TB = $3,072/month just for storage, on top of RU costs. The data had zero business value after 30 days.
Added defaultTtl: 2592000 (30 days) to the container. Cosmos silently deleted all items older than 30 days over the following 48 hours β no RU charge for the cleanup. Storage dropped from 12TB to ~40GB. Added a monitor on 'Total Data Size' with a $200 budget alert.
Takeaway: Any container holding time-bounded data (sessions, logs, temp state, rate limit counters) MUST have TTL configured from day one. Storage costs compound silently. Cosmos TTL cleanup is free β you only pay for the storage while items are alive.