22. August 2007

VPTN 2 und 3

Endlich am Wochenende auf dem Barcamp Köln die Spezifikation für "Virtual Presence User Data" fertiggestellt. Zusammen mit "Vitual Presence Location Mapping" die Grundlage für ein globales, verteiltes und chatsystemübergreifendes VP-System. 

OpenID und FOAF sind sehr eng verwandt mit VP-Identity und können vermutlich sinnvoll integriert werden. OpenID hat ein Konzept für selektive Veröffentlichung. Das fehlt bisher bei VP-Identity und FOAF. FOAF gibt es schon lange. Es ist ziemlich weit entwickelt, hat aber nie den Mainstream erreicht. OpenID ist sehr jung, aber auf dem Weg zum Mainstream. OpenID Services könnten als VP-Identity Item integriert werden, FOAF ebenfalls. Vor allem FOAF hätte auch das Potential als VP-Identity Format. Dafür müsste man FOAF erweitern um Features, wie VP-Identity Digest. VP-Identity ist eine gute Grundlage. Mal sehen was noch kommt. 


17. August 2007

Lessons for Big Systems


Take load from the DB

  • Finally the DB is the bottleneck.
  • There is only one DB (cluster), but there can be hundreds of CPUs (web server) and caches (memcache server).
  • Let the CPUs work. 10 web server CPU cycles are better, than 1 DB CPU cycle.
  • Aim at 0,1 DB operations per web page by average.
  • Make it F5-safe. No DB operations for page reloads. No DB for views.
Avoid SQL
  • Keep all live data in memory.
  • Store only for persistency, not for report generation.
  • Use a quick storage, storing 50.000 items per sec is possible
  • DB != SQL, there are quicker interfaces
  • The index is always in memory. That's what SQL DBs are good for.
  • But there are other indexes as well.
External IDs
  • Do not use DB IDs externally. Map all IDs.
  • Use memcache to map external IDs to internal (often DB) IDs.
  • Use memcache as a huge hashtable.
  • External IDs may be strings. After the mapping continue with numbers internally.
DB search loves numbers
  • Everything you search for must be indexed.
  • Avoid indexes on TEXT, VARCHAR. INSERT with index takes significantly longer for text.
  • You may store text in the DB, but do not search for it.
  • You may spend some CPU to map text IDs to numbers for the DB.
100,000 concurrent
  • Imagine 1% of your users are doing the same thing in an instant.
  • If it affects online users, then each task is x 100,000.
  • If it affects all users then everything is x 1-10 Mio.
  • Anything must be at at least 1000/per sec.
  • Do maintenance all the time. There will never be a time of the day where load is so small, that you can cleanup something. Cleanup permanently.
Memcache every business object
  • No object is constructed from the DB.
  • Everything is buffered by the cache.
  • Code with real interfaces, which can be cache-enabled later.
Code for the speed
  • Code for the cache. It is there. It is essential. No way to pretend it is not just for the "beauty" of the code.
  • Write beautiful cache-aware code.
Memcache frontend data
  • Parsing template costs much CPU.
  • Cache generated HTML fragments.
Do not overload the cache
  • Not more than 10 memcache requests per script.
  • If you expect many items, say a mailbos with many messages, then put a summary into a list (mailbox) object even though the same information is in the individual messages.
No statistics on the live system
  • Occasionally they want statistics. Don't do it live.
  • Take snapshots, take the backup. Process it somewhere else.
  • Make statistics offline.
Simple SELECTs
  • Use only simple SELECTs on indexed columns
  • Forbidden keywords: JOIN, ORDER BY
  • Structure and code must guarantee small DB results.
  • Sort in the code not in the DB.
  • If you really need aggregated data, then aggregate permanently. Do not aggregate on demand.
Basics and Trivialities:

Distribute everything
  • Do not rely on a single server for a task.
Check all input

  • Check ALL input.
  • Not only query params are input.
  • Cookies, HTTP header fields are also input.
SQL injection
  • SQL-escape all data in SQL strings.
  • Use prepared statements and variables.
  • Use a real programming language.
  • Use a compiled language, because the compiler eliminates errors.
  • You will have errors which will wake you at night. So, reduce errors by any means, even if you like script languages.
  • Simple deployment of script languages won't work anyway in the long run, because you will switch on caching and you will have to invalidate the script cache for deployment.

7. August 2007

Yet Another Tag Cloud für Blogger

Ich habe keine Tag-Cloud in der Widget Liste von Blogger gefunden. Nur ein Label-Widget, dass alle Tags als Liste darstellt. Deshalb hier eine kurze Beschriebung, wie man die Label-Liste zur Tag-Cloud umbaut.

1. Label-Widget (Seitenelement) einfügen.

2. HTML/Javascript-Widget einfügen.

3. Im HTML/Javascript-Seitenelement:

<script src="http://www.wolfspelz.de/blogger.js"></script>
<script>BloggerTagCloud('Label1', 1.0);</script>

div.Label div.widget-content ul li { display:inline; word-spacing:-0.3em; padding:0 8px 0 0; line-height:90%; text-indent:0px; }

Das 'Label1' in BloggerTagCloud('Label1') ist die DOM-ID des Label-Widgets, normalerweise Label1 für das erste Label-Widget.

Die 1.0 ist eine Skalierungsfaktor, der die Dynamik der Label-Größen bestimmt.


5. August 2007

Simple Remote Procedure Call

In my projects we often use remote procedure calls. We use various kinds, SOAP, XMLRPC, REST, JSON, conveyed by different protocols (HTTP, XMPP, even SMTP). We use whatever is appropriate in the situation, be it client-server, server-service, client-p2p, and depending on the code environment C++, C#, JScript, PHP.
With SOAP and XMLRPC you don't want to generate or parse SOAP-XML by hand. That's an avoidable error source. Rather you use a library, which does the RPC-encoding/decoding job. To do that you have to get used to the lib's API, modes of operations, and its quirks.
This is significant work until you are really in "complete advanced control" of the functionality. Especially, if there is only a method name with paramaters to exchange. Even more bothersome is the fact, that most such libraries need megabytes, have their own XML parser, their own network components. Stuff, we already have in our software for other purposes.
What we really need is a simple way to execute remote procedure calls

  • with an encoding so easy and fail safe, that it needs no library to en/decode,
  • that is so obvious, that we do not need an industry standard like SOAP, just to tell other
    developers what the RPC means.
The solution is a list of key-value pairs. This is Simple RPC (SRPC):
  • request and response are lists of key-value pairs,
  • each parameter is key=value
  • parameters separated by line feed
  • request as HTTP-POST body or HTTP-GET with query
  • response as HTTP response body
  • Content-type text/plain
  • all UTF-8
  • values must be single line (must not contain line feeds)
  • request method as Method=
Example (I love stock quote examples):
HTTP-POST request body:
  1. C: POST /srpc.php HTTP/1.1
  2. C: Content-type: text/plain; charset=UTF-8
  3. C: Content-length: 43
  4. C:
  5. C: Method=GetQuote
  6. C: Symbol=GOOG
  7. C: Date=1969-07-21
HTTP response body:
  1. S: HTTP/1.1 200 OK
  2. S: Content-type: text/plain; charset=UTF-8
  3. S:
  4. S: Status=1
  5. S: Average=123
  6. S: Low=121
  7. S: High=125
Additional options:
1. Multiline Values:
Of course, there are sometimes line feeds in RPC arguments and results. Line feeds must be encoded using HTTP-URL encoding (%0A) or a better readable "cstring" encoding (\n). The encoding is specified as meta parameter:
  1. News=Google%20Introduces%20New...%0AAnalyst%20says...
  2. News/Encoding=URL
  1. News=Google Introduces New...\nAnalyst says...
  2. News/Encoding=cstring
The "cstring" encoding replaces carriage-return (\n), line-feed (\r), and back-slash (\\). The "cstring" encoding indication, e.g. "News/Encoding=cstring" may be omitted.
2. Binary Values:
Binary values in requests and responses are base64 encoded. An optional "Type" uses MIME types to indicate the data type in case of e.g. image data.
  1. Chart=R0lGODlhkAH6AIAAAOfo7by/wCH5BA... (base64 encoded GIF)
  2. Chart/Encoding=base64
  3. Chart/Type=image/gif
3. The Query Variant:
Even complex result values, such as XML data, must be single line. Following the scheme above, this can be done by using "base64" or "cstring" encoding. Both are not easily readable in case of XML. SRPC offers a simpler way to return a single result value: if the request is HTTP-GET with query then the result value comes as response body with Content-type. It's a normal HTTP request, but SRPC conform.
HTTP query:
  1. C: GET /srpc.php?Method=GetPrices&Symbol=GOOG&Date=1969-07-21 HTTP/1.1
HTTP response (with a sample xml as a single result value):
  1. S: HTTP/1.1 200 OK
  2. S: Content-type: text/xml
  3. S:
  4. S: <?xml version="1.0"?>
  5. S: <prices>
  6. S: <price time="09:00">121.10</price>
  7. S: <price time="09:05">121.20</price>
  8. S: </prices>
4. Special Keys
There are 3 special keys defined:
  • request "Method=FunctionName" (RPC method)
  • response "Status=1" (1=OK, 0=error)
  • response "Message=An explanation" (an accompanying explanation for Status=0 or 1)
This Simple RPC specifies exactly how RPC requests are encoded. It's just lists of key=value pairs. But still powerful enough for all RPCs we need.