old dog, newish languages

Well, I've spent the last couple of days starting to come up to speed on python. I've looked it over before, and it is okay in general, and quite good in a number of ways.

I don't mind the indenting, and the OO piece doesn't seem to be quite as elaborate and wordy as most I've seen.

It is always an interesting choice about whether to spend the time to learn a new language. Even (especially) the zillionth language. There are dozens of details that are different enough that one has to examine each statement, and realize that it might not be doing anything like what I want. =20

How does one to .NOT. Will it do associative array lookups? in lists? in dictionaries? What are the idioms for end-of-file checks? What does a quoted string look like? ( r"\\n[ ]+" ) !

In the first non-trivial program, one crawls through dozens of little puzzles, with code festooned with print statements (well, print str(list) + " text" statements).

For me, the precipitating event was when my kludgy awk program handling the PYX form of a simple XML document got so complicated I couldn't follow my own logic any more, and there was an inconvenient bug.=20

This isn't the first time I've pushed awk too far. I have on this desktop before me an awk program that:

# Read file of edge pairs, one pair per line, and list the root and all # the children below below it in the tree. Edges on the left are lower in # the tree.

It's not a long program, and it doesn't matter that the console CPU lights are on solid busy for the better part of a minute. But the code was not clear. I should-a used something else.

Not that it is clear that python has good tree management routines. I haven't checked. =20

Anyway, my new program works, I understand it, it's fast as hell, and it uses a nice recursive=20 routine. I used to do a lot more recursion back in my Pascal salad days, in the first Reagan administration. It was good to uncork thinking that way about a problem again, instead of keeping state variables in a loop.

Of course, the problem that precipitated this was XML processing. I am extracting data from the Project Gutenberg catalog.rdf. I choose to extract six fields of interest, tab-separated, for some sed/grep/awk postprocessing, then creating XML for an iPad app to process. =20

There are several more approved ways to do this:

One is to read in the whole XML tree and then ask questions, traversing the nodes. This is the DOM method, and doesn't meet most of my XML needs: the data is too damn big. One test had python running over night, and up to 2.5GB of main memory, when I aborted it. My Aperture database has >100,000 photos in it, and it's too large. Even with PG's 40,000 books, the first gulp takes way too long.

There's a second technique of eating up small chunks, and then processing them. This is about the right approach for large files with many, mostly unrelated entries. Alas, the python implementation uses callbacks to report these values. This forces an obnoxious program structure so common in graphics and app programming these days. Okay, that is overstating it a bit.

The approach I still favor is conversion of the XML to PYX (basically, one XML element or thing or token) per line, processed by the usual pipeline of Unix suspects. It runs very fast and now, thanks to my shiny new python program, doesn't have a turd of inscrutable logic in the middle.

I think I will force myself to change some of the other bits to python.