Fraun Travels

Wednesday, March 09, 2011

Claw Hammer 2: Fastest in the East

My next tool utilizes the builtin Python syntax parser to do most of the pry-work: but my script retargets the formulation of the python program as Java source code, instead. Interestingly, the syntax of the one language expresses the other nicely. It needs a better project name, though.


The purpose of 'cottonhead' is to allow people comfortable with python to generate java source files without having to actually write java. I find that python goes together easier by relying on home-row-focused sequences, indentation-implied scope, and less semicolons. Somewhere I read that you can write python programs 3-5 times faster than in java. So stick this in your tool chain and snake it!


Sort of a spring break side-project. Relies on python 2.6 (because of decorators and with statement). Also, it's unfinished.

Labels:

Saturday, January 29, 2011

The Right Online Policy (Issues Response)

I think "common sense" is a pretty safe place where for Google can put its policy, which is backed up by Lessig's suggestion that cyberspace has the ability to regulate itself. But while technological capability is pretty much eventual, Lessig is saying that the "real space" values needed by cyberspace are social and civil.

Will Google's democratic kind of enforcement succeed? Communities are generally moved to act, so as long as Google continues to enforce community opinion I think it absolutely will. And they do this despite it costing them $2 million a day (1). Increasingly, the Internet is seen as a kind of public utility, a concept supported by folks such as Senator Lieberman and President Obama (2) being in favor of net neutrality.

Of course, small amounts of inappropriate content gets through the censors (Single Ladies -- not exactly 2girls1cup -- http://www.youtube.com/watch?v=ir8BO4-7DkM), but once again, innovation could be our salvation, as seen with Facebook's new Social Login (3) that strengthens the identity of users. This demonstrates the right step towards internet lawfulness because the opinion of a real individual should outweigh the opinion of an artificial entity composed primarily of interlinked news articles.

On the subject of law and in the spirit of free speech, regulation will always be challenged. Legislation like COPPA call for very specific rules, but they're really a small part of common sense. Take a look at how small it really is: http://www.coppa.org/coppa.htm. A forum like the Internet should make us capable of deliberating in vast topics and making huge decisions. Is it time for man to heed machine's rendering of multicultural law?

Tackling this issue of internet violence has been quite timely given the current protests and its role on the internet, although just researching the topics left me feeling sad and unclean (http://www.youtube.com/watch?v=gw6HpQaZz-c). I found the only way for me to discuss it was in a context of decency.

(3)

In response partially to:

Labels:

Friday, October 15, 2010

How to Forge Your Own Claw Hammer

October 15th 2010

This library is Yet Another HTML Generator, write in Python. While other offerings present a general interface, my solution is more specific, uniting Django Templates and markup under a simple, functional API. It's less focused on object side-effects, and more on generating results.

DocTools/html/

The main aim of the API is to be able to generate properly-formatted HTML templates by making nested function calls with recognizable names. To the programmer, document structure is more a matter of matching parenthesis, which is more natural in a Python IDE. This library is best used when generating markup dynamically from a coding environment. It seeks to hide the details of django and markup so the programmer only things functionally.

However, I wanted to be able to generate not just document fragments, but complete pages from the WSGI, with some top-down completeness in the expression.

(Inline sample?? pastebin??)

To make the generation more dynamic, a secondary feature of this is a template repository and loader for Django Templates, so that inheritable template structures can be created on-the-fly (without a backing file system or database store).

(Template repository reference)

Additionally, following reuse principals, each element's construction can be deferred by substituting its instantiation with a Model. This way, the object structure can be abstracted underneath the generation level.

(Each Element has a .Model attribute)

October 14th 2010

Perhaps the originating desire for a dynamic template and html content system was for speeding up development of StuphMUD web content. I spent more time on the embedded django service today, implementing a mud.runtime.Facility and reload/debugging capabilities.

(pastebin embed of embedded/service.py)

Naturally, rich site arch-types can be rooted in the file-system, but rapidly added functionality (behind special procedures and GIRL) demanded a more powerful presentation at the fingertips of the programmer.

(link to new special procedures management in mud.zones and to mud.runtime and mud.lang.girl)

Labels:

Friday, September 10, 2010

Hurrah for Browsers

While developing a website for my peer group as a school assignment, I decided to demonstrate a knowledge and use of web technologies in a novel way: using a Java applet and the corresponding Rhino scripting engine present in modern Java installations for initializing a Swing user interface.

My implementation relies on common browser functionality called LiveConnect that lets web page developers use JavaScript (ECMAScript) to access the Java objects within the applet. Given this gateway, my goal was to initialize the Swing platform with javascript code that is either embedded in the web page (as a <script> tag, but not executed by the standard SquirrelFish, V8 or MS JS interpreter), or from a separate, external file specified in HTML tags next to the applet and downloaded through the browser compartment.

Now in this case there are two flavors of javascript: one that executes in the browser, and another that executes in the Java runtime (Rhino). This is almost irrelevent, because, despite being almost syntactically and grammatically exact, their runtime environments are different. Hence, a need for a different loading mechanism for Rhino that could be stored in a separate resource, so that developers can target the Swing-UI independently.

The first attempt of inlining code within the HTML document as <script> tags proved undesirable because the browsers would just try to execute the code as their native javascript. I would need another tag unused by the HTML specification, or some kind of XML namespace magic. Needless to say, none of this was very elegant or cross-platform.

I eventually decided to support loading the UI-initialization over the wire, using AJAX (XMLHTTPRequest), of course. Armed with jQuery, I quickly adapted the applet-management code to utilize this browser functionality, and my mission was complete.

Well, sort of. It turns out that IE 8 and FF allow XMLHTTPRequest to local resources, but Chrome insists that it's violating the Cross-Domain Policy. I'm not sure why, since I haven't been able to find Chrome's source code (although aren't parts of it open?). The only thing I could find is that there's a bug in Safari that doesn't allow XHR from locally-sourced files (essentially), so Chrome “has no test case” for dealing with this, even though the error isn't in WebKit. This doesn't make sense to me, except that local development is often done with a dynamic webserver, instead of just using the file:// scheme.

So I with develop with Firefox, despite growing fond of Chrome. It works on the live server, anyway. There's also another error: it seems as though IE and Chrome try to automatically evaluate the javascript resource downloaded with the XHR as a part of the browser's native javascript environment, even though my processing logic expressly avoids this. Its target is an entirely different language runtime, so the code fails and I see an error.

When I disable the single point of configuration that specifies the external code resource, this behavior goes away. The strange part is that IE reports an error on what is obviously the first line of the code, but calls it some very huge line number:

110744134
217663110
218174798
218682374

So why would XHR have this kind of behavior? Is it because of jQuery? I'd like to quickly finish this part of the project so I can actually focus on the standardized environment of the Swing application.

Labels:

Tuesday, June 29, 2010

AppEngine's dev_appserver (More)

The dev_appserver pickled database loads into memory three different forms of each entity, two of which are encoded for quick writing to the datastore.

Queried entities are returned as copied instance data, with protocol-buffer operations.

The encoded forms are only used for writing, so why store them all the time in memory? Because every time an entity is put back into the store, all of these stored forms are updated, and the entire database is written back to file. By storing the encoded form all the time, they don't have to be generated for each write...

The following code removes the stored encoded forms, generating them on demand as descriptor properties. (file: dev_appserver/datastore_file_stub.py)

class _StoredEntity(object):
def __init__(self, entity):
# self.protobuf = entity

self.encoded_protobuf = entity.Encode()

self.native = datastore.Entity._FromPb(entity)

@property
def protobuf(self):
# Used in _Dynamic_Get and _Dynamic_GetSchema
entity = entity_pb.EntityProto()
entity.CopyFrom(self.native.ToPb())

return entity

## @property
## def encoded_protobuf(self):
## # Used only in __WriteDatastore
## return self.protobuf.Encode()


This enabling dynamically-generated members for the two encoded forms, this reduces database size from 1.4Gb to roughly 204Mb. However, to speed up writes, the above code stores the encoded form to speed up writing, and the protocol-buffer form can be generated dynamically for each query, because usually it operates on fewer entities and is entirely in-memory (as opposed to on-disk). This results in an in-memory executable image taking up 390Mb.

This utility analyzes the pickled-form of a datastore file, optionally stripping all extra dev_appserver/datastore/stub/protobuf information from the database, meaning that all data is in primitive integers, strings, lists and dictionaries. Wherever possible, strings are converted to integers. Additionally, keys take up alot of space because they contain alot of information, including path and ancestor elements.

frauncache/examine_devappserver_datastore.py

This stripped down form, when output as a compact jsonified form, takes up just over 16Mb on disk (compared to the 23Mb for the pickled datastore file). When decoded into memory, it takes up roughly 94Mb. Of course, alot of the functionality and key qualities are left implied.

Since every datastore put results in writing everything, it's best to do everything in a transaction, which means mandating a root entity... Or, switch to bdbdatastore, implemented as a java server.

So does bulkloading have a transactional form? Is it possible to incorporate the ancestor information into the exported entities?

----
Another project I was working on is to manipulate the intermediate bulkloader progress database files. While uploading/downloading entities, progress is stored in a database file, with an accompanying signature that identifies the particular bulkload operation so that an interrupted operation can be resumed with another invocation of the bulkload program.

If you want to rename these intermediate files, it breaks the signature because the progress database filename is stored as part of the signature, which is encoded into a database table row. This script lets you manipulate, in batch, these progress files in order to rename them.

frauncache/rename_bulkload_workfiles.py

It also lets you rename them with dynamically-interpolated values, so that the filename or path parts acquire the name of the model (kind) or particular progress database file type ('progress' or 'result')

Here's an example invocation:

$ python rename_bulkload_workfiles.py \
--new-db-name-template='.bulkload/progress/'`date +%s`'/$KIND/$DB_FILE_TYPE.sql3'
--rename-db-file bulkloader-progress-*

Labels:

Sunday, June 27, 2010

AppEngine's dev_appserver

I figured I'd write a little more about the Google AppEngine Datastore and my experiences with it as a developer. With infrequent internet access, the majority of my coding time is spent working on things that operate on my development machine, and not the actual AppEngine server farm. So, specifically, this is a rant about the dev_appserver tool.

Still in focus is the StuphMUD world, which on disk takes up a little less than 20Mb, text. Compressed into a zip archive, it is a little more than 4 megabytes. Now, the dev_appserver tool currently has two backing stores, one utilizing Sqlite3 and the other is a totally in-memory python runtime brew. The Sqlite3 backend isn't actually a complete replacement for the entity datastore, rather, it serves as a wrapper for storing and indexing entities wrapped in Protocol Buffers (the internal serialized representation of Google data -- code.google.com/p/protobuf).

The tradeoff between the two backends is that, while sqlite keeps everything on disk and the memory footprint low (usually less than 30Mb), doing big operations that involve most of the datastore are
slow and increase disk operation. The other python runtime solution loads everything into memory, so it should be able to do things quite fast. Of course, it takes a long time to load a large datastore back into memory when booting dev_appserver.

My current endeavors are to simulate as much of a full world as possible, including zone reset, entity movement and violence, but the sqlite backing store is noticeably slow. So I upgraded to 2Gb of system RAM and attempted the full load again, this time with the python runtime. With 271 zones, dev_appserver runs out of memory at 1.4Gb. How could this be? What is basically a 20Mb database on disk amounts to a size 7200% greater in memory!?? And truthfully, it's greater than this because the process failed with 108 more zones to load:
  • 108 / 271 = ~40%
  • 20Mb * 0.6 = 12Mb
  • 1400Mb / 12Mb = 11700%
Internal representation is almost 12 THOUSAND percent bigger?? So, apparently there is some action going on inside dev_appserver, and I suspect that Protocol Buffers for the encoded entities are being retained. I haven't looked too much into it, because I'm planning on just shrinking the size of my working set, or just sticking with the sqlite backend, even though it's unbearably slow for big operations.

My next effort will be to finish the bulkloader scripts by converting to an auto-generated form and then try to load with these instead of using my world load process. Presumably this is a lower-level operation, but I don't have faith that it is much more efficient than my load process. I should post statistics on entity-creation operations with the different backends.

Incidentally, loading all 271 zones into the sqlite backend ends up with a datastore file that is a bit over 400Mb on disk. After loading the 163 zones into the python stub, the disk file is basically 27 Megs. I succeeded in loading all 271 into AppEngine, and it takes up 25% of my 1Gb quota (including page and template content used for TurbineCMS).

Notably, Datastore Statistics > Size of all entities : 54Mb. The other 20% must be in indexes?

Labels:

Saturday, April 24, 2010

So I'm trying to load the stuph library into the Google AppEngine datastore, and there are two ways I can think about doing this. One way is to use their import interface (and upload 20Mb of uncompressed data.. boring). The other way is to upload a 4Mb zip archive of the filesystem, and then use the stuphlib (formerly libdata) package to parse and integrate into the Gql datastore.

This makes more sense because:
#1 Google's server cluster should be doing the processing, not my machine.
#2 This allows users to upload custom archive formats for groups of zones (eventually).
#3 Uploading this way uses a web interface, so progress can be monitored more fancy-like.

Modelling the world data was the easy part (except that you have to cluster data within root entities -- in this case I picked zones -- to enable transactions which make things acceptably fast). I actually found the bug in the playerfile loader too.

Now doing the data processing takes up some cycles, way more than 30 seconds (which is the default deadline per request) to load the entire world. So I wrote another load script that overloads stuphlib.wldlib.WorldLoader called IncrementalWorldLoader. As you may have guessed, this takes the archive, parses the index and creates an UploadJob (in the datastore). THEN, it uses the task-queue API to postpone loading in the background and associates these tasks with the UploadJob. The task only processes what it can before regenerating the task and terminating, and can be tuned to avoid gobbling up all of the CPU quota.

The downside to this is that the incremental uploader has to refer to the uploaded world archive, which is stored in a datastore Blob. And this means retrieving that archive blob everytime the task wants to continue processing. The most clever way I can come up with is using memcache to store this archive, but it's not like it's actually sitting in the same memory compartment as the webapp: it has to be serialized and deserialized across IO. This has a way of eating up quota anyway...

The other limitation is that currently my GAE account has a limit of 1 megabyte per blob. So I wrote a script today that creates segments out of the currently world and zips them into archives for separate processing. Technically, the web interface could be modified to create separate UploadJobs for all of this and run them concurrently.

http://code.google.com/p/frauncache/source/browse/trunk/StuphTools/splitindex.py

If the initial request could skip storing the full archive and simply break the index into its separate parts (storing THOSE), then it this process would integrate perfectly with individually-uploaded world files. What AppEngine needs is a way to schedule low-priority background tasks that can go on for quite a while longer.

But for this I expect they'll want me to enable billing.

Labels: