Programming :: How To Extract A Subset From A Huge Dataset

Mar 13, 2010

I have a huge file which has 450G. Its format is as below

x1 50020 A 1
x1 50021 B 8
x1 50022 C 9

[code]....

Now, I want to extract a subset from this file. In this subset, column 1 is x10, column 2 is from 600000 to 30000000. I wrote the following perl script but it doesn't work:

#!/usr/bin/perl
$file1 = $ARGV[0]; # Input file
$file2 = $ARGV[1]; # Output file

[code]...

I guess the input file and output file are both too big that my script can't handle it.

View 11 Replies


ADVERTISEMENT

General :: Subset A Large Dataset By Specifing The Starting & End Line?

Aug 27, 2010

This is my first time on this forum. I am a statistician. I am trying to subset a large dataset by specifing the starting & end line. The dataset is pretty large (more than 300 million lines), containing around 1.2 million lines for a person. So I would like to split the dataset into per person consecutively. I tried wrap r codes, but R seems to have to read from top to where I want although I specified that it should skip the lines that other tasks have read. So the memory is increasing with the task ID. Finally I got kicked out by the administer.

I guess that shell may do it much simple and elegently. First I thought of "split" command. But the the file has a header of 10 lines. So I can't split it into even size chuncks.

View 5 Replies View Related

Programming :: Efficient Access Of Huge Files Or Defrag Ext4?

Sep 30, 2010

I need to figure out how to arrange for the fastest-possible read-access of a large or huge memory-mapped file. I'm writing high-speed real-time object-chasing software for a NASA telescope (on earth). This software must detect images of fast moving objects (across arbitrary fields of fixed stars), estimate what direction and speed the object image is traveling (based on the length and direction of a streak on the detection image), then chase after the object while capturing new 4Kx4K pixel images every 2~5 seconds, quickly matching its speed and trajectory, then continue to track and capture images until the object vanishes (below horizon, into earth shadow, etc).

I have created two star "catalogs". Both contain the same 1+ billion stars (and other objects), but one is a "master catalog" that contains all known information about each object (128 bytes per object == 143GB) while the other is a "nightly build" that only contains the information necessary to perform the real-time process (32 bytes per object == 36GB) with object positions precisely updated for precession and proper-motion each night. Almost always the information in the "nightly build" catalog will be sufficient for the high-speed (real-time) processes.

[Code]...

View 8 Replies View Related

Programming :: Implement User Ranking In Php With A Huge Number Of Users?

Aug 21, 2010

I'm writing a user ranking module for a site. This ranking depends on some criterias and it's possible to set or unset any one of these criterias in order to consider them in calculating the user rank or not. And here's the way I've implemented the ranking calculation :

when I set one or more of the criterias to be considered in ranking , for each user in the system I insert one record for each criteria , for example : if I have 2 criterias and both are set and consider that I have two users , I'll have :

Ranking table
--------------
username | criteria | to_be_added | score
--------------------------------------------------
user1 | criteria1 | 1 | 0

[code]....

It means I just set the to_be_added field to 1 for all of them and leave the calculation of score for each criteria for each user to the time the user logins so that to prevent doing all these calculations at once , because there are a huge number of users ... But there is one problem , if I want to show for example the best user (based on the highest score) , the result can't always be true because some users might not logged in at that time and their score might be zero .

View 1 Replies View Related

Ubuntu :: Rsync Vackup To Remote Server Of Large Dataset

Jun 18, 2010

I have cygwin on Windows XP running rsync to remote Ubuntu server over ssh using ADSL.My data set is about 20Gb! But, Cygwin will backup incrementally, so after the first backup the process should be relatively quick.With ADSL the first backups will take too long. I was thinking about doing the first backup by copying files to an external hard drive then attaching the hard drive to my remote server and copying the files. The idea being that rsync will pick up the files as if it had created them in the first instance. The incremental backups will then pickup from there.

Does anyone have any experience with this and/or can provide any advice? The external hd is fat-32 which is okay with Windows and should be okay with Ubuntu? From XP right click copy and then paste keeps the file dates intact on the external hd - is this enough to get rsync going incrementally?

View 1 Replies View Related

General :: Most Efficient Way Of Taking Subset Of Lines

Feb 9, 2010

I have two files. One huge one (200.000+ lines) called 'db' and one big one (15.000+ lines) called 'indices'.What is the quickest way of filtering out the lines in 'db' containing any index (anywhere on the line) from 'indices'.Is there a faster approach in bash, linux?

View 1 Replies View Related

Ubuntu :: Shell - Run Command For Subset Of Files In Directory?

Jan 28, 2011

I cannot find the way to run some command for a subset of files in directory - how can I do it

View 3 Replies View Related

Programming :: Extract Value From String?

Dec 27, 2008

I need to extract a price from a string, this may vary in the future so it may be 12.99 or 14.99. i thought a sed command might crack it and i need to write to a file:<td><b class="priceLarge">?6.99</b>I need to extract the price 6.99(with no ?)so extract anyhting between "> and </B> and write it to a file such as tmp1.txt .

View 1 Replies View Related

Programming :: Cannot Extract Webpage For Processing

Aug 25, 2010

I am trying to extract a web page via Google for processing. I am able to create a proper query and test it using cut/paste into the address bar of my firefox browser.

When I attempt to extract the page with wget:
wget -O - -q "$query"
I do not see the information that is present when I used the browser.

View 2 Replies View Related

Programming :: Extract 2 Numbers From A Same File?

Dec 22, 2010

I am trying to extract 2 numbers from a same file and my goal is to print them both in another file, on the same line, separated with a space. I have to do that for 20 files and I would like to have therefore 20 lines like this in the output file. It would look like this :

Quote:

number1_file1 number2_file1
number1_file2 number2_file2
...
...
number_1_file20 number2_file20

So far, I did only extract one number and got an output file like this :

Quote:

number1_file1
number1_file2
...
...
number1_file20

And I did this by running a bash script with the following content :

Code:

#!/bin/bash
ls execution$1$2*.* | while read filename
do
cat $filename | grep -e "Total aborts:" | cut -d " " -f3 >> abort$1$2.dat
done

$1 and $2 are just strings to identify the different files I want to consider in this loop. This script works well to extract a number which is the 3rd field of a line starting with "Total aborts:".Now, how could I change this script to do what I mentioned above (i.e. extracting two numbers from two different lines) ? The second number is the 3rd field of a line starting with "Total throughput:"

View 7 Replies View Related

Programming :: Extract Part Of File Name?

Mar 26, 2010

I have this string ./DAT000728-652523058.job.I want to extract the no between DAT and - sign. I want 728. I dont want 000728.echo ./DAT000725-560162365.job | cut -d'T' -f2 | cut -d'-' -f1 I am getting 000728.string can be ./DAT326822-652523058.job also. then i need 326822

View 6 Replies View Related

Programming :: Extract Portion Of Text - IRC Log

Jul 18, 2011

I have a lot of files containing chat-log (IRC) and would like to extract information out of these files.

File sample

Code:
Session Start: Sat Apr 03 15:06:29 2010
Session Ident: XXX
[15:06] XXX is ~X@host-85-85-85-154.isp.be * XXX
[15:06] XXX on #channel1 #channel2 #channel3

[Code]....

View 2 Replies View Related

Programming :: Extract The Text From Files?

Aug 28, 2010

I have many files in a folder from which I need to extract some contents, these are basically text files wich have individual lines with (i.e)

name: john
address: whatever
phone: 123456

Some caveats

1. Sometimes a line might be missing.

name: johnn
phone: 123456

2. Lines are not in the same line-numbers across the files I did try some things with awk based on google searches but I couldn't extract the data of each file into a single line (this is the ultimate goal):

john,whatever,123456

I don't have knowledge other than having put some bash scripts together for backup jobs, so I am open to install anything that could to pull this off.

View 14 Replies View Related

Programming :: Extract URL From Firefox Address Bar ?

Nov 27, 2010

I'm trying to create an application that monitors, among other things, what site the user is currently viewing. I would like to know if there is any way to get the current URL from the Firefox's address bar on a Linux machine. I know that under Windows I can use the DDE server approach, but under Linux this task is proving very tricky. I've considered an approach involving an extension to Firefox, but this would require the user to install the said extension himself. Which is not something I want. If an extension can be installed by a different program's installer than that could work, but I don't know if that's possible or not.

View 14 Replies View Related

Programming :: Extract Values From Array PHP

Jul 2, 2009

The idea is to make a website to check the availability of domains and it works but its not pretty yet. Below is what i have till so far:

## this is the API from my domain registrar.
<?php $client = new SoapClient('http://api.sync.com/?wsdl');
## I have a search box that sends the request to this page
$var = $_GET ["s"];

## remove the most common subdomains from the request.
$var=eregi_replace("www.", "", $var);
$var=eregi_replace("mail.", "", $var);
$var=eregi_replace("ftp.", "", $var);
$var=eregi_replace("pop.", "", $var);
$var=eregi_replace("smtp.", "", $var);

## remove any TLD extension from the request.
$split = explode(".", $var);
$main = $split[0];
$arraysize = sizeof($split);
for ($x=1; $x<$arraysize; $x++) {
$tld .= "." . $split[$x];
}
## login to the API
$paramLogin = array('handle' => 'randall', 'password' => 'password');

## match the domain with any possible TLD
$varcom = $paramAvailDomain = array('sld' => $main, 'tld' => 'com');
$varnet = $paramAvailDomain = array('sld' => $main, 'tld' => 'net');
$varorg = $paramAvailDomain = array('sld' => $main, 'tld' => 'org');
$varbiz = $paramAvailDomain = array('sld' => $main, 'tld' => 'biz');
$varinfo = $paramAvailDomain = array('sld' => $main, 'tld' => 'info');
$vareu = $paramAvailDomain = array('sld' => $main, 'tld' => 'eu');
$varnl = $paramAvailDomain = array('sld' => $main, 'tld' => 'nl');
$varbe = $paramAvailDomain = array('sld' => $main, 'tld' => 'be');
$varde = $paramAvailDomain = array('sld' => $main, 'tld' => 'de');
$varcouk = $paramAvailDomain = array('sld' => $main, 'tld' => 'co.uk');
$varorguk = $paramAvailDomain = array('sld' => $main, 'tld' => 'org.uk');
$varname = $paramAvailDomain = array('sld' => $main, 'tld' => 'name');
$varmobi = $paramAvailDomain = array('sld' => $main, 'tld' => 'mobi');
$varin = $paramAvailDomain = array('sld' => $main, 'tld' => 'in');
$vartv = $paramAvailDomain = array('sld' => $main, 'tld' => 'tv');
$varcn = $paramAvailDomain = array('sld' => $main, 'tld' => 'cn');
$varws = $paramAvailDomain = array('sld' => $main, 'tld' => 'ws');
$varnu = $paramAvailDomain = array('sld' => $main, 'tld' => 'nu');
$varbz = $paramAvailDomain = array('sld' => $main, 'tld' => 'bz');
$varcc = $paramAvailDomain = array('sld' => $main, 'tld' => 'cc');

## this requests the domain.COM and domain.NET
$varcom;
$varnet;
?>
<div id="content">

## below prints the result
<?php
print "<html><body><pre>";
$result1 = $client->__soapCall('Login', $paramLogin);
echo "<b>Result Login:</b>
" . print_r($result1, true);

$result15 = $client->__soapCall('AvailabilityDomain', $varcom);
$resvarcom = var_dump($result15, true);
$result15 = $client->__soapCall('AvailabilityDomain', $varnet);
$resvarnet = var_dump($result15, true);

print "</pre></html>";
?>
<?php

## the returned array looks like this

Result Login:
Array
(
[code] => 200
[message] => Login succesful
)
array(3) {
["code"]=>
string(3) "200"
["message"]=>
string(20) "Domain not available"
["result"]=>
object(stdClass)#236 (1) {
["status"]=>
string(5) "TAKEN"
}
}
bool(true)
array(3) {
["code"]=>
string(3) "200"
["message"]=>
string(16) "Domain available"
["result"]=>
object(stdClass)#232 (1) {
["status"]=>
string(4) "FREE"
}
}
bool(true)
?>
## till so far it works

What I need to do is to make this ugly looking reply in to something more readable, basically if TAKEN print occupied and if free print its yours to grab. I have been struggling with the in_array function but i'm not getting anywhere close in getting it to work.

View 2 Replies View Related

Programming :: Extract Metadata From Image

Mar 13, 2009

I am trying to get the metadata out from an image file in python. I have tried using PIL but it does not give me the data I am looking for (mostly just got a bunch of hex code) and I have no idea how to use ImageMagick, the python module is poorly documented and I can't find any examples on the net.The info I need is stuff like camera model, if flash was used, focal length, exposure time, date, etc.. pretty much the same info I get when I look at the "Image" tab on properties in Nautilus on Ubuntu.

What I am doing is writing a script that will iterate through a lot of pictures and put all this metadata into MySQL. I chose python since it is simple and I am familiar with it. But I can't find a good way to get that metadata from within python.

View 2 Replies View Related

Programming :: Extract Source Email Address - Awk?

Dec 17, 2010

I have a small bash/awk program that extracts the date/time/size of thousands of email headers. I'm trying to also extract the last "Received from:" string from these email headers which will give me the senders email server. on extracting the last occurrence of this string, and printing the information after it?

View 3 Replies View Related

Programming :: Awk To Extract Phrase Between Two Words On A Line?

May 25, 2010

im trying to find a way to extract the phrase between the words Connection and is (ie the underlined words below). Can we use awk to do this? How? Is it the best command to use?

Code:

[06:25:00][i] Connection at Plant A is live
[06:25:00][i] Connection at Building_C is not live
[07:25:00][i] Connection at Terminal D is down

View 12 Replies View Related

Programming :: Extract A Substring Using Regular Expression With SED

May 7, 2011

I've spent most of the evening browsing the web, trying many things I've found on various forums, but nothing seems to work.

I have a test.txt file containing many lines like the following ones :

...
<insert_random_text>228.00 €<insert_more_random_text>
<insert_random_text>17.50 €<insert_more_random_text>
<insert_random_text>1238.13 €<insert_more_random_text>
...

And I want to extract :

...
228.00
17.50
1238.13
...

There is always one occurrence of € in each line. I want the numeric value that precedes this € occurrence. The random text (before and after) may contain numbers too, so the € may be important to parse, in order to correctly identify the number to return. The last character that precedes the number to extract is always a ">" (coming from an HTML tag).

View 9 Replies View Related

Programming :: Extract The Full Name Of The Process Running In Box?

Jul 11, 2011

I have a requirement where I want to extract the full name of the process running in my box. I tired various options of ps. The wide option gave me the full command but that contains command, the interpreter and also the arguments passed.

Code:

XX XX XX XX XX XX /usr/bin/sh /path/to/exe/myexe.sh arg1 arg2 arg3.

Is there any way from ps or any other command I can extract the full name of the command

Desired Output :

Code:

/path/to/exe/myexe.sh or myexe.sh

View 12 Replies View Related

Programming :: How To Extract Output Ulimit Under Popen

Jan 30, 2011

A strange question, I guess. I'm running processes called from a c main program. The calling is performed (for now) as: FILE * res=popen(ulimit -t 1; prg args); So I can read the stdout of the process as a file and analyze it. The time limit is important for me.

2 questions:
1. How do I get to know if the process terminated on its own or by the ulimit?
2. How do I limit to times that are less than 1 sec (I have many of those).

I know that setrlimit exists, just before I change my whole approach I wanted to see if I can deal with these things from the outside.

View 2 Replies View Related

Programming :: Perl Extract Information For A Particular Column?

Oct 24, 2010

I have a file which has the output as shown below:

Code:

Teams | matches |Goals | YC | RC
------------------------------------------------------------------------------
Liverpool: | | | |
Gerrard | 97 | 100 | 41665 | 1342

[code]....

I need to extract the Info from the RC column for the first 4 players of liverpool. The test code i have does the same,but can anyone show me a better way of doing it.I could do it easily with gawk -F"|" and print the respective column,but i need to do this in perl.

Code:

#!/usr/bin/perl
use strict;
use warnings;

[code]....

View 7 Replies View Related

Programming :: Use Awk And Sed To Extract Height And Width Of File

Aug 3, 2010

I am trying to write commands that extracts the height and width of a video file via ffmpeg. I have the following working so far:

This gives the following answer in widthxheight format with an extra , 720x480,

How can I instead run 2 separate commands that give me height and width separately? I want some command to give me 720 and another command to give me 480 and I dont need the x or the ,

If you need to know this is what ffmpeg -i videofile.mov 2>&1 gives as output

Seems stream 0 codec frame rate differs from container frame rate:

At least one output file must be specified

View 6 Replies View Related

Programming :: Extract Via Command Line The Latitude And Longitude?

Aug 10, 2010

I trying to extract via command line the latitude and longitude with this command Code: curl -s [URL]

View 4 Replies View Related

Programming :: Extract Specific Lines From A Flat File?

Mar 10, 2011

I'm trying to extract specific lines from a flat file. I need lines that fall within a range of coordinates. The -F can be either ! or = If the line is in this set range I need all of the data on that line. ranges lat 36 to 39 and longitude -74 to -84

awk -F '=' '{lat=substr($2,1,2); lon=substr($2,10,3); (lat >36 && lat <39) && (lon >-74 && lon <-84); print lat"--"lon}' < net.log

example line from the flat file
K4MQF-3>APN383,VA2-2,qAR,N3HF-5:!3818.65NS07800.17W#PHG77306/W3,VA3/Clarke Mnt

View 9 Replies View Related

Programming :: Extract Dwarf Information Debug A Section?

May 10, 2011

I would like to extract debug information but I have some problems. For example, I have a executable a.out...

Quote:

nm -f sysv a.out | grep ".global_var" >vars.txt

With this command I extract all my variables. All of them are in .global_var section, and it give me follow information:

Quote:

CAN_station_n |08073258| D | OBJECT|00000001| |.global_var
CONTROLend |080732a7| D | OBJECT|00000001| |.global_var

[code]....

Well, I have only address of my vars, but I would like to know type var or struct of the variables. With dwarf dump I have all of information, but it is a mess...

Quote:

<1><117bc>: Abbrev Number: 32 (DW_TAG_variable)
<117bd> DW_AT_name : (indirect string, offset: 0x153d): draw_limits
<117c1> DW_AT_decl_file : 128
<117c2> DW_AT_decl_line : 207

[code]...

Is there any parser or way to put in order this information?? create a file with the follow information:

name of var - address - type - size - struct or not

View 5 Replies View Related

Programming :: Unable To Extract Java Regex For Links

Mar 22, 2011

Im trying to extract the href of a <link> tag from a html page however as some links contain further preferences I seem to be unable to extract them, do you have any idea how I can write this: Link:

[Code]...

View 9 Replies View Related

Programming :: Use Perl Regex To Extract The Hostname From A FQDN?

Nov 19, 2010

How do i use perl regex to extract the hostname from a FQDN?

I have

Quote:

$host=ganymede.a.linux.com
$host=io.a.linux.com
$host=europa.a.linux.com

i just want the characters which are to the left of the first .(dot) in FQDN name. I could get it using substr and split function,but how do i get it through regex.

View 13 Replies View Related

Programming :: Extract Extra Packages In A Separate File?

Mar 6, 2009

I have two files containing list of packages using

Code:

dpkg --get-selections > file-name
command.
package-a.txt

[code]......

Now I would like to create a third file which contains only those packages which are present in package-a.txt but NOT in package-b.txt. The file should look like this:

Code:

package2
package4

Note: The world "install" is also to be removed for all packages. Using diff command I could get something like this:

Code:

temp# diff -yb package-a.txt package-b.txt | grep "<"
package2 install <
package4 install <
<
temp#

But not sure how to remove instances of "install" and "<".

View 3 Replies View Related

Programming :: Extract Data By Sending Queries To A Website?

Dec 17, 2010

What would be the best way to extract data by sending queries to a website?

View 2 Replies View Related







Copyrights 2005-15 www.BigResource.com, All rights reserved