Programming :: Splitting Text File Into Each Duplicate
Dec 10, 2010
I have a text file that is filled with references to duplicate files. I'm trying to create a text file for each duplicate file found that contains the paths to the duplicates. I would also like the text file names to be based on the size and file name.
Was wondering if any perl guru's could help me with a quick log file adjustment. I have a text file that looks like so (tabs and newlines are revealed so you can see what separates the data):
There are maybe 100 lines of text in this file at any given time. I need to delete all duplicate lines only looking at the first bit of text prior to the first tab. It doesn't matter which one gets deleted as long as there are no two lines that begin with that same text at the beginning before the first tab. So in this example, either the fist line "1234" or the last line "1234" would need to be deleted. I already have code in my script that opens the files - I just need the code to read the text into an array and the part that would find matches based on the above criteria, and make the deletions.
If it would be easier, I can even do a system call and use SED (v4.1.5) and/or AWK (3.1.5) instead.
I have a utility that works with files. The utility is crashing at after about 120 files. The input to the utility is a file containing a filelist. I want to cut the file with the file names in it to seperate files containing about one hundred or so. My thought was to determine the number of lines/100 and then use head and delete to create temporary files to run the utility multiple times to prevent the crash. When I tried to create a variable using the wc -l command the output gives me the number of total lines but it also includes the filename of the input file. (873 Filename.txt) I can not figure out how to remove the Filename.txt from the variable.
I am splitting a file based on the values read from an input file. The below one is the script.
1)How do I add the header which is present in the original file to the new split files created?(For eg. pharmacyf conatins header as table column names. The new files created (ODS.POS.$pharmacyid.$tablename.$CURRENT_DATE.dat) are without the header).
2) Also the script is creating 0 byte files for the pharmacyids which are not available in the intial file? Can this be avoided?
for pharmacyf in * do tablename=`echo $pharmacyf |cut -f4 -d'.' ` while read pharmacyid do grep -w $pharmacyid $pharmacyf >> $OUT/ODS.POS.$pharmacyid.$tablename.$CURRENT_DATE.dat done< inputfile done
Contained within each of these 67 text files is about 1 million urls. Yes. I have 67 text files that contain 1 million lines of urls each. I am sure I am swimming in duplicates. I tried opening one text file and clicking sort ----->remove duplicates. Now Gedit is not responding my processor is maxed out to 100% and I think I am finally ready to delve into some command line code. Can anyone give me idiot proof instructions on how to sort the duplicates out of each one of these 67 text files? How about no duplicates across all 67?
I am basically trying to remove duplicate words in my <title></title> tag after I got hit by Google Panda. I have around 750 .html files and it will be difficult for to me remove one by one. I am looking for a way to remove only from within <title> </title>
Example of a duplicate title I have:
Code:
<title>Pasta, Pasta Recipe and Pasta Guide</title>
I dont want to replace those words anywhere else in the file except for within the <title>
i have a big file of random numbers i generated at some point in time, after working with it with different things(how fun that was)... i want to remove duplicate lines and i'm not sure i'm doing this right
Trying to remove lines from a syslog text file that have duplicate strings
Mar 10 06:51:11[http-8080-1] INFO com.MYCOMPANY.webservices.userservice.web.UserServiceController [u:2533274802474744|360] Authorize [platformI$tformIdAndOs=2533274802474744|360, userRegion=America|360]
then a few lines down
Mar 10 06:52:03 [http-8080-1] INFO com.MYCOMPANY.webservices.userservice.web.UserServiceController [u:2533274802474744|360] Authorize [platformI$tformIdAndOs=2533274802474744|360, userRegion=America|360
got the same thing in terms of a u: number but the issue is I need to remove duplicates and just leave one and the file has multiple duplicates of different u: numbers and it's 14,000 lines long. can anyone tell me if I can use awk? sed? or sort for something like this to? removing lines that have a certain string in there that's a duplicate.
I have a text file with many pairs of number, one pair in each line. Each 25 of these pairs are a solution to a math problem I've been working on, and each solution is separated from another by a line with "**********".The problem is that there are duplicate solutions. In order to know exactly how many solutions I found, I have to delete the duplicate ones. How can I do that?Just to make things clear, here are the first three solutions:
I am facing a problem while splitting a text file, I need to split a file into some parts and each split file should have 2000 lines, when I do it through "split" command the mother file is kept intact but as per my requirement I need to cut mother file into some parts thus it should not be kept intact.
I need to insert 3-4 lines of text to the beginning of a text file. The file is a largish MYSQL dump, the result of a backup shell script. This shell script should insert the required text.I've wrestled with sed, but lost.
I have to delete a certain line of text from the a textfile via ubuntu's shell scripting.I have done research, and it seems that most people advocate the usage of sed /d option. sed makes does not edit the text file. Hence, most options I discovered involved the use of a temporary variable/textfile and then overwriting the old file with the temporary new file. Is there anyway whereby I can bypass the use of temporary storage containers? I hope there is any magical combination of commands to edit the file directly.
I want to display something in my text view widget in glade using c code. that's all right. now I need to attach a save button beneath the text view.so that on click the text view content should save as a txt file..
I want to display the contents of a particular log file (simple text file, I mean in Linux). But there is a problem: The contents need to be organized in a fixed format. Have a look at this log file:
So, while displaying the contents of above file on a web page, I want to format the field names found in the log file: User Name:, Reported Problems Description:, and Remarks:. These fields may contain a variable length of text and no specific line number is assumed for them to appear on.
Well, what I am trying to do may sound wierd to some of you. The filed "Reported Problems Description:" can possible contain text which embeds colon (.
a sed command to add a text before line number in text file? I have text file with 500 lines, and i want to add 3 more lines with text after line 300, OR before line 302, isn't no problem.
I've a web page where I can search information in my database.Sometimes results of search operation are more of fifty.How can I split results in different web page as it happens for example in any search engine or operation
I have a .txt-file with ~50.000 lines of numbers, generated by a mathematics program. From this file, I need line ~ 1.100 to line ~16.000 (these lines are always the same btw, this may make the solution easier, dunno) to be copy/pasted to another file, where the lines ~500 to ~15.000 (also, every time the same) should be overwritten by the aforementioned lines...I haven't found or come up with anything that works yet, mostly I find solutions to copy everything from one file to another but I can't find something to specifically overwrite a part of a file with part of another.
What i am trying to do is i want to add numbers from 1 to 100. but that too using multiprocessing. So i made a c programme and using fork() command made two child processes. Now in one child process i am adding from 1 to 50. and in another i am adding 51 to 100. and then in the parent process adding the two results to get the final one. Now the result from the two function i am getting correctly. But after the wait() call the value returned is lost : See the programme below for reference
I'm trying to transfer a large .tgz file from a CentOS dedicated server to a linux webhost (unknown OS). The problem is the webhost will not allow a 1.1gb file to be uploaded, however it will allow the upload in 149MB chunks. I used the split command to segment my tgz into 7 segments under 150mb. I then uploaded all segments via FTP which worked. Then I tried to join the segments to create the original tgz. The join appears to work with no issues. However, when I try to extract the tgz it appears there is a problem, most, but not all files are extracted and there is this error message:
Code: gzip: stdin: Input/output error tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now It appears the join did not work and the tgz is slightly corrupt. What am I doing wrong? Here's the commands I'm using:
1. Create the original tgz on the dedicated server Code: tar -czf mysite.tgz ./myfolder 2. Split the tgz into segments Code: split -b 149m -d mysite.tgz seg
# using the -d switch so the segment files use a numerical suffix # I now have these files: seg00 seg01 seg02 seg03 seg04 seg05 seg06 seg07
3. Transfer segments to the other webhost using FTP Code: # hand typing (not a script) ftp ftp.mysite.com myusername mypassword binary cd somefolder put seg00 put seg01 put seg02 # through to seg07
4. Join up the segments on the new webhost Code: # this is in a .sh script file cd /full/path/to/somefolder cat seg* > mysite.tgz 5. Extract the new tgz Code: # this is in a .sh script file cd /full/path/to/somefolder tar -xzf mysite.tgz # the above error is now thrown.
That's it. What am I doing wrong that's causing the above error?
I have a text field that is just list of servers and I need to add the word hostname in front of them... It must be brain fart but I can't think of how to do this. Basically I need this:
I need to extract som text from a text file. The text is a test log with system info at the top and results further down. What I need is to add different tags with formatting before and after each line. I have prepared a template with html formatting, but the number of lines in the test log may be different from case to case, so I need to be able to add formatting tags by need. Can this be done using bash script, sed, awk, head, tail... ?
I have a file with 5000 lines. it is a list of books authors, series and titles. all lines start with the author names, than there is a dash (-) than the series name, a dash again and the title of the book.
The problem I encounter is that sometime there is a series, sometime not, and as I try to enter this list in a database, I wanted to create a cvs file to import into mysql.
ex:
The best would be to be able to add in the second line, a "space dash space" just after the author name, but how to make sure it does not do it to the first line as well.
If I could separate all line with 2 dash, (grep ?) then I would be able to do a simple replace, and change the single dash into two.
I have a plain text file with 360 lines of varying length text. How do I add a comma or other symbol to the end of each line so that I can convert the file to csv format that I can open in a spreadsheet (45 rows, 8 columns). That means each 8 lines of text forms 8 columns, with 45 rows.
Im trying to read a file in c++ and search for particular character for example if this is a list that I have:
Alice Bob David
[code]....
if the input is D, it should give David, if its B, gives bob. so in this case, meaning it reads the first character of every line. but if possible I want to make this dynamic so the user can specify which character position he is looking for, so in case he is looking for R as character index 3 in all lines, it should give Charlie. but the problem is, it does now recognize , besides, I do not know how to specify the character position in each line.