Programming :: Python - Parsing Data In Chunks / Sections?
Nov 5, 2010
As a curious side project I'm playing with mzXML data(an xml format for holding mass spec data). A typical scan can be quite large, even up into GB size. I'm wondering how would one go about parsing an xml file in sections, one section at a time. The idea being if the computer doesn't have enough memory to load up the entire data file, work on chunks of it at a time.
My C foo is terrible! I am working with some code which reads lines from a file and then reformats the lines and writes them to a new file.The input lines look like this:
I've been loosely following this:http://norvig.com/lispy.htmlAnd I have a problem: the parsing function throws an array out of bounds exception. I thought that maybe I'm doing it wrong, so I copy and pasted the code from the page, and still the same error
I have a hard drive with a bad PCB board. It stays on when not under heavy load and it will restart if I copy too much data off it. So far I have had good luck doing folders under 500 MB in size if I copy one folder to my good hard drive, wait five minutes, copy another, etc.
If I mount the bad drive and try to copy a folder of several GBs in size it will start and then stop as the hard drive restarts. When I try to mount the drive again Linux says it can't read the superblock. I have several folders with over 30 GB of data in many different folders.
What I am looking for is a way of copying a folder in Linux such that the commands grab the whole folder in chunks with a timed break in-between.
I want to copy about 40GB - to a partiton. There are two hard drives in my box one won't boot but I can aaccess it and mount partitions and I aim to move data from it to a new bootable hard drive. Doing a simple cp copy command may not be the best way to copy and paste such a large chunk? Also I want to backup the data I plan to copy/paste using a USB hard drive to backup. But I could also paste data from the backup to the new drive instead of from old internal hd to new hd. - that's another option.
I have man properly installed on my boxes (I have ubuntu and debian, I would also be interested in how to do this for fedora) however only section 1 of the man pages is available. How do I install additionally sections? I am particularly in need of sections 2 and 3. command line instructions needed. I don't have a desktop environment on all my boxes.
I am developing a program in a system where the Linux does not take care of the sync command automatically. So I have to run it from my application always I save some data in the disk, which in my case is a 2GB sdcard. It is true that I can make the operation system takes care of the syncronization, using a proper mount option, but in this case the programm's performance drops drastically. In particular I use the shelve module from Python to save data that comes from a socket/TCP connection and I have to deal with the potencial risk of the system being turned off suddenly Initially I wrote something like that to save data using shelve:
But that takes too much time to save the data. Note that I use the sync from the OS every time I close a file to prevent data corruption in the case of the "computer" being turned off with data even in the buffer. To improve the performance I made something like that:
Code:
def saveListData( list ) fd = shelve.open('file_name', 'c') for itemVo in list: fd[itemVo.key] = itemVo fd.close() os.system("sync")
Thus, first I saved an amount of objects in a list then I open the file and save the objects. In this way I have to open the file just one time to save a lot of objects.However I would like to know if adding a lot of objects before closing the file would increase the risk of data corruption.I known that turning off the system after fd.close() and before os.sync may cause problems. But what about turning off the system after
This is my file, named xxx.c. And there are two functions in it.
[code]...
When I try to compile the whole project, I used -ffunction-sections,-fdata-sections to generate the .o files, use Wl,--gc-sections -Wl,--print-gc-sections to link to the exacutable file. gcc 4.3.2 ld 2.18 Debian 5 After the first time I compiled it, I got a list of Removed functions. But one second later, I rememberd that I forget to close the Compile Switch: Debug. Then I had to recompile it and got another list. Compare these two lists, I found that , bxx isn't on the second list. First I guess that bxx is a debug function but never used in debug mode, and it won't be compiled in release mode. I check the source and find that there was no compile switch for bxx. But it's caller function axx, is removed, both the debug switch is on and off. I try to compiled the project for several times , but the result is the same. I can't realized it , why? Is that the --gc-sections won't remove all the functions not used?
how to make a site like this one, LinuxQuestions itself - it puts a thin line around each post, to demarcate it - for the website I'm building, I need exactly this functionality. Do I have to use the "gd" library?
I have a system setup script for my Slackware installations that pulls all packages and source files from another machine and sets everything up to be identical between machines. The script works as expected but make it entirely unattended. How do I make the bash script deal with automatically selecting "Yes" for, for example "Install x(Yes/No): " when prompted by a make file?
We have a system called Skynet, which is basically a bunch of monitoring tools, including Nagios. What I want to do is output the status of 'critical' processes in conky. The conky part I'll worry about later (how hard can that be?), but I'm looking for some feedback on how I'm parsing the initial data. I figure that the simplest way to get the information is to query the cgi, then take what I need from the results...
All I basically want is the server name and the process name, the above example giving server0/server1 and 'update status' as the service. How would you go about extracting merely these two pieces of information, bearing in mind that the server name and process are variable?
I am looking for ways in where I could parse a web-page, say lURL... and need to parse data of the web-page. URL....Now, as can be seen the page has lot of information. I just need/want to only take the names of the packages rather than version numbers.
i am trying to get a script that i'm calling to have information from a sql populate into rows... but i'm not getting the data to output correctly into the rows. can someone please help?
Im trying to figure out how to display parts of a .db file created by the scorch2000 server to display a player name, games played, score and maybe more...
I dont want to display everything, of course ^^, but how do i get the player name, the number of games he played and his score to display it in a webpage in this fasion:
Name Games Played Score
joe blow1 25 9876890 joe blow2 31 8989767 joe blow2 26 7989767 joe blow2 17 5989767 joe blow2 13 4989767
and by highest score because the log doesin't put them in in score order....
please help, i asked the maker because he has one runing already but no answer back, well the game is pretty old so i didn't really expect an answer anyways and tryed to figure it out but i dont know functions in php, this is to include in a php-nuke block (this i know how to do
here is an example of a working page at the developper website:url
I have a function definition in a Python 2.x script which take a tuple as one of its arguments, but 2to3 has no answers nor any of my searching on how to represent the same in Python 3.x
I have logs files from freeradius that have looks as follows:
$ grep "Login incorrect (rlm_ldap: User not found" /var/log/radius/radiusd-inner-tunnel-20090831.log Mon Aug 31 09:25:27 2009 : Auth: Login incorrect (rlm_ldap: User not found): [John Doe] (from client oficina port 0 via TLS tunnel)
[code]....
I use the following line to get the amount of users that don't exist on ldap:
Code:
grep "Login incorrect (rlm_ldap: User not found" /var/log/radius/radiusd-inner-tunnel-20090831.log | awk '{print $14}' | sort -fu | wc -l
Now, awk on line one for example parses [John Doe] and [Joon Williams] as "[John" and that it's not what I'd want. I mean how could I do for awk looks username field as closed between squared brackets?
Is there a Linux system call that can be used to get the group name from the GID returned by stat()? I realize that I could parse /etc/groups (if my user had sufficient permissions).
What I am after is to get the string text from the clip tags. But for now I just tested to see if it can finds the command tags and print something if it does. But it doesn't find it. Anyone knows why ?
Looks like the xml is not good, i test it with a xml validator:
I have a log file (test.log) starting & ending within dash (--) as below. I am looking to write a parser for test.log. This test.log file currently has single value for one Job ID but I wish to parse for repeated N values of different Job ID - Job, User, Queue, Dispatched Date, Dispatched Time, Completed Date, Completed Time, Hosts/Processor, CPU_T and TURNAROUND. I can either output this 10 values in another .log file or dump into cgi.
The selected parameters from test.log for parsing with above 10 attributes are -
I have a variable in which the data is stored as below:
variable_test=0m0.001s 0m0.001s 0m0.001s 0m0.001s 0m0.001s 0m0.001s .....an so on.
There are lots of values in format like "3m1.057s" are stored in variable_test separated with an space between two such values. For exapmple, value is "3m1.057s" I need to save different parts of a value in three separate array variables such as the
var_hour=3 var_min=1 var_sec= 057
tell if this can be done using "awk". A "WHILE" loop might be used to separate and store theses values I guess?
I want to know how to get eg. the contents of a form on a webpage which has been passed to a server side PHP script, inside for example an array which I can read. I've been reading a ebook on PHP which as far as I can see doesn't cover this inside it.
I'm trying to recreate a simple script I wrote to parse out the access.log to get a rough idea of websites that users are going to on our corp network. The issue I'm having is I want to pull out any line from access.log that ends in .com/ .org/ .net/ or whatever to only see what the user entered into the address bar and drop pictures, js's and everything else and log only this. so what I do is :awk '{print $8} | grep -e '[cong]|[ore]|[mgtv][/]'$ and nothing happens.I know there is an easier way to do this with awk alone,