8 Data Manipulation
This guide covers essential commands for handling text data in files, including searching, sorting, and saving command outputs. These tools are invaluable for data analysis and scripting in Unix-like environments.
8.1 Basic Text Search
The grep
command is a versatile tool for searching text within files:
Basic Search: Find lines containing a specific word in a file.
grep "myword" file.txt
Case Insensitive: Ignore case distinctions.
grep -i "myword" file.txt
Line Numbers: Show line numbers of matching lines.
grep -n "myword" file.txt
Inverse Search: Display lines that do not contain the specified word.
grep -v "myword" file.txt
Recursive Search: Search within a directory and all its subdirectories.
grep -r "myword" ./mydirectory/
8.2 Regular Expressions
Regular expressions enhance grep
’s searching capabilities. Use the -E
option for extended regex support:
.
: Matches any single character.^
: Matches the start of a line.$
: Matches the end of a line.[ ]
: Matches any character within the brackets.?
: The preceding element is optional.*
: The preceding element can appear zero or more times.+
: The preceding element must appear one or more times.|
: Logical OR operator between expressions.
Example: Search for multiple words.
grep -E 'word1|word2|word3' myfile.txt
8.3 Sorting File Lines
Ascending Order: Default sorting.
sort myfile.txt
Descending Order: Reverse the sort order.
sort -r myfile.txt
Random Order: Shuffle lines.
sort -R myfile.txt
Numeric Sort: Treat comparisons as numerical.
sort -n myfile.txt
Saving Output: Use
-o
to save the sorted result to a file.sort -o sorted_file.txt myfile.txt
8.4 Counting Text Elements
Basic Count: Displays line, word, and byte counts.
wc myfile.txt
Lines Only: Count the number of lines.
wc -l myfile.txt
Words Only: Count the number of words.
wc -w myfile.txt
Bytes Only: Count the number of bytes.
wc -c myfile.txt
Characters Only: Count the number of characters.
wc -m myfile.txt
8.5 Removing Duplicates with uniq
Basic Usage: Filter out adjacent duplicate lines.
uniq myfile.txt
Saving Output: Redirect the output to a new file.
uniq myfile.txt > result.txt
Count Occurrences: Prefix lines by their occurrence counts.
uniq -c myfile.txt
Show Duplicates Only: Display only the repeated lines.
uniq -d myfile.txt
8.6 Extracting Columns with cut
For files with delimited columns, cut
allows you to extract specific fields:
- Specify Delimiter: Use
-d
to define the column delimiter. - Select Columns:
-f
selects the columns to extract.
Examples:
# Extract columns 1 to 3
cut -d ',' -f 1-3 myfile.txt
# Extract from column 3 onwards
cut -d ',' -f 3- myfile.txt
8.7 Redirection and Pipes
Standard Output to File (
>
): Create or overwrite a file with the command output.grep "myword" myfile.txt > result.txt
Append to File (
>>
): Add the command output to the end of an existing file.grep "myword" myfile.txt >> result.txt
Standard Error to File (
2>
): Redirect error messages to a file.grep "myword" myfile.txt 2> error.log
Combine Output and Errors (
2>&1
): Direct both standard output and errors to
the same file. bash grep "myword" myfile.txt > result.txt 2>&1
Pipes (
|
): Use the output of one command as input to another.grep "myword" myfile.txt | sort
8.8 Viewing File Contents
To display the contents of a file directly in the terminal:
cat myfile.txt
This command prints the entire content of myfile.txt
to the screen.
8.9 Interactive Terminal Input
For interactive input, especially useful for commands like sort
, you can use the here document syntax:
sort -n << END
After executing this command, you can type in the words or lines you wish to sort. Each line you enter will be considered for sorting. Once you’re done, type END
to indicate the completion of input and perform the sorting operation.
8.10 Conclusion
These commands form the foundation of text processing and data manipulation in Unix-like systems, enabling efficient analysis and transformation of data.