General :: Most Efficient Way Of Taking Subset Of Lines
Feb 9, 2010
I have two files. One huge one (200.000+ lines) called 'db' and one big one (15.000+ lines) called 'indices'.What is the quickest way of filtering out the lines in 'db' containing any index (anywhere on the line) from 'indices'.Is there a faster approach in bash, linux?
This is my first time on this forum. I am a statistician. I am trying to subset a large dataset by specifing the starting & end line. The dataset is pretty large (more than 300 million lines), containing around 1.2 million lines for a person. So I would like to split the dataset into per person consecutively. I tried wrap r codes, but R seems to have to read from top to where I want although I specified that it should skip the lines that other tasks have read. So the memory is increasing with the task ID. Finally I got kicked out by the administer.
I guess that shell may do it much simple and elegently. First I thought of "split" command. But the the file has a header of 10 lines. So I can't split it into even size chuncks.
I have a huge file which has 450G. Its format is as below
x1 50020 A 1 x1 50021 B 8 x1 50022 C 9
[code]....
Now, I want to extract a subset from this file. In this subset, column 1 is x10, column 2 is from 600000 to 30000000. I wrote the following perl script but it doesn't work:
I've come across an unusual requirement for a service in my Ubuntu system.Simply put, I need to find a way to search for all instances of a term in a file, delete lines containing containing that term, and delete four lines below each instance of that term. ither that, or copy the entirety of a file to a new file and skip over all lines containing the term plus four below it.This sounds kinda weird, I know. Without going too far into detail, I either have to change the logfile format for a server I'm running which is a huge pain in the butt, or I can just run a script to edit an HTML report generated from said logs. (Said report is really just for managers to peruse, and I like my log format, so I'm pursuing option 2.)
I am starting a project of my own (and learning C++ at the same time. I got my program to successfully scan a custom netmask, but it is REALLY slow. I want my program to do something similar to nmap -sP xxx.xxx.xxx.xxx-xxx. How to speed it up? Such as pinging more than one IP at a time...
I wrote this script to attach url's to specified 6 digit numbers in a configuration text file. My original goal was to be able to be able to pull the url's and the 6 digit numbers from .csv files. that would allow me to make the script more versatile, not only for this particular project, but also for other projects in regards to the configuration file. This script works, and has served it's purpose, but it is not very pretty, and it's probably not very efficient. What can I do to improve it and possibly make it more versatile. I've thought about functions and arrays, but my skill set is still pretty limited. I'm not looking for someone to write it for me, just to point me in the right direction.
I am trying to setup the most efficient way to remotely administer an ubuntu system. So far I was successful in setting up ssh. I have the server running and made sure it only uses keys to authenticate and changed the default port. I can connect to the ssh-server. This system is behind nat and bots trying to break-in when I forwarded the port. I used the AllowUsers option, which helped in knocking back these rouge connections. I was not comfortable leaving a port open all the time, so I set up logmein Hamachi VPN and closed the ports that were open on NAT.
Now I use Hamachi and am very impressed. I use VNC in VPN. But Vino is slow, I tested x11vnc and it was slightly faster. I also tried to set up freenx but nxsetup file was missing. some one posted that file in the forums and suggested copying to the system but I was not sure if I want to download a script from the forum and use it. So for the time being, I thought I will just concentrate on X11vnc until freenx is fixed.
I would like to create a separate desktop on the server and use it to access remotely. This way we will not have to fight over the cursor (sounds funny but very frustrating). From what I understand x11vnc has this function inbuilt (uses vncserver). I cannot get it to work! when using ssh, at the remote machine's command prompt, if I type "evince xyz.pdf", it does not launch the "xyz.pdf". what am I doing wrong? When using VNC, how can I change between different users on the remote machines? Right now I can only use VNC in one account that I first setup. When I try to change to another user, vnc client goes blank. How can I get as much control of the remote system as possible? Right now I need someone to switch on the remote system and login to their account before I can VNC. Can I use ekiga inside VPN, I use skype right now and it's sometimes terrible with dropouts. I looked around and someone suggested using ekiga as it works very well over lan. How do set up ekiga so that I have a direct connection with the remote system? Is there any client that just takes the IP address and creates a connection? Mumble seems to be the popular choice, it's most suited to LANparties, I just need a connection between two systems. I am trying to set this up to help users with hardly any computer knowledge to use the system without any issues.
I need to figure out how to arrange for the fastest-possible read-access of a large or huge memory-mapped file. I'm writing high-speed real-time object-chasing software for a NASA telescope (on earth). This software must detect images of fast moving objects (across arbitrary fields of fixed stars), estimate what direction and speed the object image is traveling (based on the length and direction of a streak on the detection image), then chase after the object while capturing new 4Kx4K pixel images every 2~5 seconds, quickly matching its speed and trajectory, then continue to track and capture images until the object vanishes (below horizon, into earth shadow, etc).
I have created two star "catalogs". Both contain the same 1+ billion stars (and other objects), but one is a "master catalog" that contains all known information about each object (128 bytes per object == 143GB) while the other is a "nightly build" that only contains the information necessary to perform the real-time process (32 bytes per object == 36GB) with object positions precisely updated for precession and proper-motion each night. Almost always the information in the "nightly build" catalog will be sufficient for the high-speed (real-time) processes.
I have been experiencing a problem where the screen loads and after initial first few lines breaks up into multiple repetitions of lines. Reloading helps but has to be repeated when pageing down. Mail is no problem; it is supplied by my network provider. OS is openSUSE 11.2 which I update when advised. Below is a sample from the error console:
I've just installed Kubuntu 11.04, switched on wobbly windows effect. It runs very smooth on my Nvidia GeForce 7600 GS with dual screen twinview turned on. However, I get these lines when I drag/move the window upwards - see screenshot:
I have this massive table file with some data in it and I want to replace some lines that are wrong with the correct ones that are in another table file of the same format. The wrong lines are not all together in a block but randomly distributed so I need to make a loop checking if the line is in the other file and if it is, replace it. I want to try and do it with sed or awk but I don't really know how to....
I'm working on a tutorial using Backtrack 4 Live USB, and I would like to take a screencast of what I'm doing (not just screenshots)So far I have tried these application with limited success:-recordmydesktop -xvidcap -wink -istanbul -vlc -vnc2flvEach time I try the resulting files are generally choppy (at best 1 frame per second) and most don't even end up with a clear view of the screen each time.
I'm trying to dual boot a server with Windows 2008 R2 and RHEL 5.4. I've done these dual boots 20 or more times in the last few months, and never had a problem. (This is the first time I've tried it with R2, mind you.) I do the normal install Windows on one drive, then install RHEL on another. The first time it reboots after the install (which is still part of the install, as it's my kickstart script prompting the install and it still has post to run), it goes to the grub menu, and I can select RHEL and it boots into it fine. After that, if I reboot it just goes directly into Windows, without seeing grub at all.
I've tried pressing esc, shift, various things in case it's hidden, with no luck (also, I have a 30 second timeout set and it's not sitting there that long). I've tried editing the grub.conf to remove Windows entirely, and it still just goes into Windows. I've reinstalled RHEL 3 times (yay for kickstart files!), and this exact behaviour happens every time. Does anyone have any idea of what might be going on here
NIC is connected to a LINKSYS WRT54G running DHCP. There are plenty of available IP assignments. All other PC that I have connected to the LINKSYS work fine. The CAT 5 cable is fine.Why is this NIC not taking a DHCP assignment ?
I have got a refurbished acer aspire one with linpus linux...now 1. the mozilla firefox browser seemed pretty much old...but if I try to download a new version it comes us as a zip file... 2. same to microsoft docx...they come as zip file...can not sort out the updating of the browser, can not download docs files... 3. also the desktop is a very handy, but unattractive looking one...how can I get a desktop customize by my own?... 4.the inbuilt messanger is not taking any id except the one from gmail.
I have ubuntu 10.10 installedWhen it was new It was too speedy in starting upbut, Now It is taking about two to three minutes to getting started.I tried to remove some applications from System/Preferences/Startup Applications,But no differenceI want to speed my systemIt has no problem in speed after starting up
For some purpose in my home network I want to get a specific ip to my localhost on that machine.Like say I typed 123.123.123.123 I want that go to localhost.
I have 2 Oracle users that generate .tmp files under /var/tmp. By default, the files have the permissions 644. Now, a need has arisen whereby the files created by these users have to have the permission bits as 664. Obviously, I changed the UMASK value for these users from 022 to 02. But the files are still getting created with 644 as the permission.
I tried restarting the application as I read that a relogin is required for the UMASK change to take effect. Even that hasn't helped.
I have 2 hard drives and when I do an auto install.. Ubuntu decides to take both hard drives... how do I prevent this from happening without my intervention?
drw-rw-rw- 2 owner developers 4096 Jun 24 15:13 models
these were set with
sudo chmod -R 0666 *
My user has developers as the primary group (the same group as the file), but I cannot access the directory via the terminal or ftp.
[myUser@machine]$ id myUser uid=503(myUser) gid=505(developers) groups=505(developers) $ cd models -bash: cd: models: Permission denied
I had the same problem before the directory belonged to my group, and I even went so far as to restart the server, without any luck. How do I set permissions to this directory so that I and other members of the group developers can access it?
The find command is taking too long on my machine to complete. When I use time command, I find that sys time and user time are too small as compared to real time. Is my find process not getting scheduled properly?
I interrupted the neverending find command and got the following statistics:
Real time : 5min Sys time : 1.1 sec User time : 3 sec
I have around 100 users. I want to take backup of files which are on desktop for every user. My user directory path is -: /home/dr/<user_name>/Desktop
1) Script has to run on a particular time everyday 2) Script has to take backup of all files present in "Desktop" directory 3) Make a tar with name "yyyy-mm-dd-desk-files" 4) Make directory outside "Desktop" with name "Desktop-Backup", if already exist then don't make this folder. 5) The tar have to moved in this folder. 6) Remove the files from "Desktop" directory. (i.e. Desktop should be empty) 7) Mail the status that "Backup Successful"
I have about 200k data entries in xml file. I wrote php script (using php-xml) to read xml file and insert into mysql. At first it went really quickly inserting, then after a while after inserting 100k entries, it slowed right down, just like it would not even doing anything. I have CentOs with 512M on VirtualBox running as server.
I have some confusion about one of my partition and the space it is taking. df -h output is given below;
# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/ddf1_ADVDTARTINGp1 494G 18G 452G 4% /
[code]....
above information is showing that /var/lib/mysql partition total size is 379 GB and it is 68% used. However when I execute command du -sh /var/lib/mysql it shows following output.
# du -sh /var/lib/mysql 45G /var/lib/mysql
Now I want to know what files are taking space to make the partition 68% used. I want to list down all files in that partition with size.
I am working in ubuntu 9.x (linux karmic kernal) .I have restored the content from CD to hard disk. In the mid way of this process, it was failed. I would like to know this below thiongs,
1) which position it got failed ?
2)Any offset option is there in linux to point the particular CD position ?