Programming :: How To Extract A Subset From A Huge Dataset
Mar 13, 2010
I have a huge file which has 450G. Its format is as below
x1 50020 A 1
x1 50021 B 8
x1 50022 C 9
[code]....
Now, I want to extract a subset from this file. In this subset, column 1 is x10, column 2 is from 600000 to 30000000. I wrote the following perl script but it doesn't work:
#!/usr/bin/perl
$file1 = $ARGV[0]; # Input file
$file2 = $ARGV[1]; # Output file
[code]...
I guess the input file and output file are both too big that my script can't handle it.
View 11 Replies
ADVERTISEMENT
Aug 27, 2010
This is my first time on this forum. I am a statistician. I am trying to subset a large dataset by specifing the starting & end line. The dataset is pretty large (more than 300 million lines), containing around 1.2 million lines for a person. So I would like to split the dataset into per person consecutively. I tried wrap r codes, but R seems to have to read from top to where I want although I specified that it should skip the lines that other tasks have read. So the memory is increasing with the task ID. Finally I got kicked out by the administer.
I guess that shell may do it much simple and elegently. First I thought of "split" command. But the the file has a header of 10 lines. So I can't split it into even size chuncks.
View 5 Replies
View Related
Sep 30, 2010
I need to figure out how to arrange for the fastest-possible read-access of a large or huge memory-mapped file. I'm writing high-speed real-time object-chasing software for a NASA telescope (on earth). This software must detect images of fast moving objects (across arbitrary fields of fixed stars), estimate what direction and speed the object image is traveling (based on the length and direction of a streak on the detection image), then chase after the object while capturing new 4Kx4K pixel images every 2~5 seconds, quickly matching its speed and trajectory, then continue to track and capture images until the object vanishes (below horizon, into earth shadow, etc).
I have created two star "catalogs". Both contain the same 1+ billion stars (and other objects), but one is a "master catalog" that contains all known information about each object (128 bytes per object == 143GB) while the other is a "nightly build" that only contains the information necessary to perform the real-time process (32 bytes per object == 36GB) with object positions precisely updated for precession and proper-motion each night. Almost always the information in the "nightly build" catalog will be sufficient for the high-speed (real-time) processes.
[Code]...
View 8 Replies
View Related
Aug 21, 2010
I'm writing a user ranking module for a site. This ranking depends on some criterias and it's possible to set or unset any one of these criterias in order to consider them in calculating the user rank or not. And here's the way I've implemented the ranking calculation :
when I set one or more of the criterias to be considered in ranking , for each user in the system I insert one record for each criteria , for example : if I have 2 criterias and both are set and consider that I have two users , I'll have :
Ranking table
--------------
username | criteria | to_be_added | score
--------------------------------------------------
user1 | criteria1 | 1 | 0
[code]....
It means I just set the to_be_added field to 1 for all of them and leave the calculation of score for each criteria for each user to the time the user logins so that to prevent doing all these calculations at once , because there are a huge number of users ... But there is one problem , if I want to show for example the best user (based on the highest score) , the result can't always be true because some users might not logged in at that time and their score might be zero .
View 1 Replies
View Related
Jun 18, 2010
I have cygwin on Windows XP running rsync to remote Ubuntu server over ssh using ADSL.My data set is about 20Gb! But, Cygwin will backup incrementally, so after the first backup the process should be relatively quick.With ADSL the first backups will take too long. I was thinking about doing the first backup by copying files to an external hard drive then attaching the hard drive to my remote server and copying the files. The idea being that rsync will pick up the files as if it had created them in the first instance. The incremental backups will then pickup from there.
Does anyone have any experience with this and/or can provide any advice? The external hd is fat-32 which is okay with Windows and should be okay with Ubuntu? From XP right click copy and then paste keeps the file dates intact on the external hd - is this enough to get rsync going incrementally?
View 1 Replies
View Related
Feb 9, 2010
I have two files. One huge one (200.000+ lines) called 'db' and one big one (15.000+ lines) called 'indices'.What is the quickest way of filtering out the lines in 'db' containing any index (anywhere on the line) from 'indices'.Is there a faster approach in bash, linux?
View 1 Replies
View Related
Jan 28, 2011
I cannot find the way to run some command for a subset of files in directory - how can I do it
View 3 Replies
View Related
Dec 27, 2008
I need to extract a price from a string, this may vary in the future so it may be 12.99 or 14.99. i thought a sed command might crack it and i need to write to a file:<td><b class="priceLarge">?6.99</b>I need to extract the price 6.99(with no ?)so extract anyhting between "> and </B> and write it to a file such as tmp1.txt .
View 1 Replies
View Related
Aug 25, 2010
I am trying to extract a web page via Google for processing. I am able to create a proper query and test it using cut/paste into the address bar of my firefox browser.
When I attempt to extract the page with wget:
wget -O - -q "$query"
I do not see the information that is present when I used the browser.
View 2 Replies
View Related
Dec 22, 2010
I am trying to extract 2 numbers from a same file and my goal is to print them both in another file, on the same line, separated with a space. I have to do that for 20 files and I would like to have therefore 20 lines like this in the output file. It would look like this :
Quote:
number1_file1 number2_file1
number1_file2 number2_file2
...
...
number_1_file20 number2_file20
So far, I did only extract one number and got an output file like this :
Quote:
number1_file1
number1_file2
...
...
number1_file20
And I did this by running a bash script with the following content :
Code:
#!/bin/bash
ls execution$1$2*.* | while read filename
do
cat $filename | grep -e "Total aborts:" | cut -d " " -f3 >> abort$1$2.dat
done
$1 and $2 are just strings to identify the different files I want to consider in this loop. This script works well to extract a number which is the 3rd field of a line starting with "Total aborts:".Now, how could I change this script to do what I mentioned above (i.e. extracting two numbers from two different lines) ? The second number is the 3rd field of a line starting with "Total throughput:"
View 7 Replies
View Related
Mar 26, 2010
I have this string ./DAT000728-652523058.job.I want to extract the no between DAT and - sign. I want 728. I dont want 000728.echo ./DAT000725-560162365.job | cut -d'T' -f2 | cut -d'-' -f1 I am getting 000728.string can be ./DAT326822-652523058.job also. then i need 326822
View 6 Replies
View Related
Jul 18, 2011
I have a lot of files containing chat-log (IRC) and would like to extract information out of these files.
File sample
Code:
Session Start: Sat Apr 03 15:06:29 2010
Session Ident: XXX
[15:06] XXX is ~X@host-85-85-85-154.isp.be * XXX
[15:06] XXX on #channel1 #channel2 #channel3
[Code]....
View 2 Replies
View Related
Aug 28, 2010
I have many files in a folder from which I need to extract some contents, these are basically text files wich have individual lines with (i.e)
name: john
address: whatever
phone: 123456
Some caveats
1. Sometimes a line might be missing.
name: johnn
phone: 123456
2. Lines are not in the same line-numbers across the files I did try some things with awk based on google searches but I couldn't extract the data of each file into a single line (this is the ultimate goal):
john,whatever,123456
I don't have knowledge other than having put some bash scripts together for backup jobs, so I am open to install anything that could to pull this off.
View 14 Replies
View Related
Nov 27, 2010
I'm trying to create an application that monitors, among other things, what site the user is currently viewing. I would like to know if there is any way to get the current URL from the Firefox's address bar on a Linux machine. I know that under Windows I can use the DDE server approach, but under Linux this task is proving very tricky. I've considered an approach involving an extension to Firefox, but this would require the user to install the said extension himself. Which is not something I want. If an extension can be installed by a different program's installer than that could work, but I don't know if that's possible or not.
View 14 Replies
View Related
Jul 2, 2009
The idea is to make a website to check the availability of domains and it works but its not pretty yet. Below is what i have till so far:
## this is the API from my domain registrar.
<?php $client = new SoapClient('http://api.sync.com/?wsdl');
## I have a search box that sends the request to this page
$var = $_GET ["s"];
## remove the most common subdomains from the request.
$var=eregi_replace("www.", "", $var);
$var=eregi_replace("mail.", "", $var);
$var=eregi_replace("ftp.", "", $var);
$var=eregi_replace("pop.", "", $var);
$var=eregi_replace("smtp.", "", $var);
## remove any TLD extension from the request.
$split = explode(".", $var);
$main = $split[0];
$arraysize = sizeof($split);
for ($x=1; $x<$arraysize; $x++) {
$tld .= "." . $split[$x];
}
## login to the API
$paramLogin = array('handle' => 'randall', 'password' => 'password');
## match the domain with any possible TLD
$varcom = $paramAvailDomain = array('sld' => $main, 'tld' => 'com');
$varnet = $paramAvailDomain = array('sld' => $main, 'tld' => 'net');
$varorg = $paramAvailDomain = array('sld' => $main, 'tld' => 'org');
$varbiz = $paramAvailDomain = array('sld' => $main, 'tld' => 'biz');
$varinfo = $paramAvailDomain = array('sld' => $main, 'tld' => 'info');
$vareu = $paramAvailDomain = array('sld' => $main, 'tld' => 'eu');
$varnl = $paramAvailDomain = array('sld' => $main, 'tld' => 'nl');
$varbe = $paramAvailDomain = array('sld' => $main, 'tld' => 'be');
$varde = $paramAvailDomain = array('sld' => $main, 'tld' => 'de');
$varcouk = $paramAvailDomain = array('sld' => $main, 'tld' => 'co.uk');
$varorguk = $paramAvailDomain = array('sld' => $main, 'tld' => 'org.uk');
$varname = $paramAvailDomain = array('sld' => $main, 'tld' => 'name');
$varmobi = $paramAvailDomain = array('sld' => $main, 'tld' => 'mobi');
$varin = $paramAvailDomain = array('sld' => $main, 'tld' => 'in');
$vartv = $paramAvailDomain = array('sld' => $main, 'tld' => 'tv');
$varcn = $paramAvailDomain = array('sld' => $main, 'tld' => 'cn');
$varws = $paramAvailDomain = array('sld' => $main, 'tld' => 'ws');
$varnu = $paramAvailDomain = array('sld' => $main, 'tld' => 'nu');
$varbz = $paramAvailDomain = array('sld' => $main, 'tld' => 'bz');
$varcc = $paramAvailDomain = array('sld' => $main, 'tld' => 'cc');
## this requests the domain.COM and domain.NET
$varcom;
$varnet;
?>
<div id="content">
## below prints the result
<?php
print "<html><body><pre>";
$result1 = $client->__soapCall('Login', $paramLogin);
echo "<b>Result Login:</b>
" . print_r($result1, true);
$result15 = $client->__soapCall('AvailabilityDomain', $varcom);
$resvarcom = var_dump($result15, true);
$result15 = $client->__soapCall('AvailabilityDomain', $varnet);
$resvarnet = var_dump($result15, true);
print "</pre></html>";
?>
<?php
## the returned array looks like this
Result Login:
Array
(
[code] => 200
[message] => Login succesful
)
array(3) {
["code"]=>
string(3) "200"
["message"]=>
string(20) "Domain not available"
["result"]=>
object(stdClass)#236 (1) {
["status"]=>
string(5) "TAKEN"
}
}
bool(true)
array(3) {
["code"]=>
string(3) "200"
["message"]=>
string(16) "Domain available"
["result"]=>
object(stdClass)#232 (1) {
["status"]=>
string(4) "FREE"
}
}
bool(true)
?>
## till so far it works
What I need to do is to make this ugly looking reply in to something more readable, basically if TAKEN print occupied and if free print its yours to grab. I have been struggling with the in_array function but i'm not getting anywhere close in getting it to work.
View 2 Replies
View Related
Mar 13, 2009
I am trying to get the metadata out from an image file in python. I have tried using PIL but it does not give me the data I am looking for (mostly just got a bunch of hex code) and I have no idea how to use ImageMagick, the python module is poorly documented and I can't find any examples on the net.The info I need is stuff like camera model, if flash was used, focal length, exposure time, date, etc.. pretty much the same info I get when I look at the "Image" tab on properties in Nautilus on Ubuntu.
What I am doing is writing a script that will iterate through a lot of pictures and put all this metadata into MySQL. I chose python since it is simple and I am familiar with it. But I can't find a good way to get that metadata from within python.
View 2 Replies
View Related
Dec 17, 2010
I have a small bash/awk program that extracts the date/time/size of thousands of email headers. I'm trying to also extract the last "Received from:" string from these email headers which will give me the senders email server. on extracting the last occurrence of this string, and printing the information after it?
View 3 Replies
View Related
May 25, 2010
im trying to find a way to extract the phrase between the words Connection and is (ie the underlined words below). Can we use awk to do this? How? Is it the best command to use?
Code:
[06:25:00][i] Connection at Plant A is live
[06:25:00][i] Connection at Building_C is not live
[07:25:00][i] Connection at Terminal D is down
View 12 Replies
View Related
May 7, 2011
I've spent most of the evening browsing the web, trying many things I've found on various forums, but nothing seems to work.
I have a test.txt file containing many lines like the following ones :
...
<insert_random_text>228.00 €<insert_more_random_text>
<insert_random_text>17.50 €<insert_more_random_text>
<insert_random_text>1238.13 €<insert_more_random_text>
...
And I want to extract :
...
228.00
17.50
1238.13
...
There is always one occurrence of € in each line. I want the numeric value that precedes this € occurrence. The random text (before and after) may contain numbers too, so the € may be important to parse, in order to correctly identify the number to return. The last character that precedes the number to extract is always a ">" (coming from an HTML tag).
View 9 Replies
View Related
Jul 11, 2011
I have a requirement where I want to extract the full name of the process running in my box. I tired various options of ps. The wide option gave me the full command but that contains command, the interpreter and also the arguments passed.
Code:
XX XX XX XX XX XX /usr/bin/sh /path/to/exe/myexe.sh arg1 arg2 arg3.
Is there any way from ps or any other command I can extract the full name of the command
Desired Output :
Code:
/path/to/exe/myexe.sh or myexe.sh
View 12 Replies
View Related
Jan 30, 2011
A strange question, I guess. I'm running processes called from a c main program. The calling is performed (for now) as: FILE * res=popen(ulimit -t 1; prg args); So I can read the stdout of the process as a file and analyze it. The time limit is important for me.
2 questions:
1. How do I get to know if the process terminated on its own or by the ulimit?
2. How do I limit to times that are less than 1 sec (I have many of those).
I know that setrlimit exists, just before I change my whole approach I wanted to see if I can deal with these things from the outside.
View 2 Replies
View Related
Oct 24, 2010
I have a file which has the output as shown below:
Code:
Teams | matches |Goals | YC | RC
------------------------------------------------------------------------------
Liverpool: | | | |
Gerrard | 97 | 100 | 41665 | 1342
[code]....
I need to extract the Info from the RC column for the first 4 players of liverpool. The test code i have does the same,but can anyone show me a better way of doing it.I could do it easily with gawk -F"|" and print the respective column,but i need to do this in perl.
Code:
#!/usr/bin/perl
use strict;
use warnings;
[code]....
View 7 Replies
View Related
Aug 3, 2010
I am trying to write commands that extracts the height and width of a video file via ffmpeg. I have the following working so far:
This gives the following answer in widthxheight format with an extra , 720x480,
How can I instead run 2 separate commands that give me height and width separately? I want some command to give me 720 and another command to give me 480 and I dont need the x or the ,
If you need to know this is what ffmpeg -i videofile.mov 2>&1 gives as output
Seems stream 0 codec frame rate differs from container frame rate:
At least one output file must be specified
View 6 Replies
View Related
Aug 10, 2010
I trying to extract via command line the latitude and longitude with this command Code: curl -s [URL]
View 4 Replies
View Related
Mar 10, 2011
I'm trying to extract specific lines from a flat file. I need lines that fall within a range of coordinates. The -F can be either ! or = If the line is in this set range I need all of the data on that line. ranges lat 36 to 39 and longitude -74 to -84
awk -F '=' '{lat=substr($2,1,2); lon=substr($2,10,3); (lat >36 && lat <39) && (lon >-74 && lon <-84); print lat"--"lon}' < net.log
example line from the flat file
K4MQF-3>APN383,VA2-2,qAR,N3HF-5:!3818.65NS07800.17W#PHG77306/W3,VA3/Clarke Mnt
View 9 Replies
View Related
May 10, 2011
I would like to extract debug information but I have some problems. For example, I have a executable a.out...
Quote:
nm -f sysv a.out | grep ".global_var" >vars.txt
With this command I extract all my variables. All of them are in .global_var section, and it give me follow information:
Quote:
CAN_station_n |08073258| D | OBJECT|00000001| |.global_var
CONTROLend |080732a7| D | OBJECT|00000001| |.global_var
[code]....
Well, I have only address of my vars, but I would like to know type var or struct of the variables. With dwarf dump I have all of information, but it is a mess...
Quote:
<1><117bc>: Abbrev Number: 32 (DW_TAG_variable)
<117bd> DW_AT_name : (indirect string, offset: 0x153d): draw_limits
<117c1> DW_AT_decl_file : 128
<117c2> DW_AT_decl_line : 207
[code]...
Is there any parser or way to put in order this information?? create a file with the follow information:
name of var - address - type - size - struct or not
View 5 Replies
View Related
Mar 22, 2011
Im trying to extract the href of a <link> tag from a html page however as some links contain further preferences I seem to be unable to extract them, do you have any idea how I can write this: Link:
[Code]...
View 9 Replies
View Related
Nov 19, 2010
How do i use perl regex to extract the hostname from a FQDN?
I have
Quote:
$host=ganymede.a.linux.com
$host=io.a.linux.com
$host=europa.a.linux.com
i just want the characters which are to the left of the first .(dot) in FQDN name. I could get it using substr and split function,but how do i get it through regex.
View 13 Replies
View Related
Mar 6, 2009
I have two files containing list of packages using
Code:
dpkg --get-selections > file-name
command.
package-a.txt
[code]......
Now I would like to create a third file which contains only those packages which are present in package-a.txt but NOT in package-b.txt. The file should look like this:
Code:
package2
package4
Note: The world "install" is also to be removed for all packages. Using diff command I could get something like this:
Code:
temp# diff -yb package-a.txt package-b.txt | grep "<"
package2 install <
package4 install <
<
temp#
But not sure how to remove instances of "install" and "<".
View 3 Replies
View Related
Dec 17, 2010
What would be the best way to extract data by sending queries to a website?
View 2 Replies
View Related