Using Shell Scripts for Sequence Processing
Overview
This guide covers how to use shell scripts to run sequence processing programs efficiently.
Basic Shell Script Structure
Shell scripts provide a way to automate sequence processing workflows. Here’s a basic template:
#!/bin/bash
# Script: process_sequences.sh
# Description: Process sequencing data
# Set error handling
set -e # Exit on error
set -u # Exit on undefined variable
set -o pipefail # Exit on pipe failure
# Define input/output paths
INPUT_DIR="data/raw"
OUTPUT_DIR="data/processed"
# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"
echo "Starting sequence processing..."
# Your processing commands hereRunning Programs in Shell Scripts
Basic Command Execution
# Run a single command
program_name input.fastq > output.txt
# Run with multiple parameters
program_name --input input.fastq --output output.bam --threads 4Error Checking
# Check if command succeeded
if ! program_name input.fastq > output.txt; then
echo "Error: Processing failed" >&2
exit 1
fi
# Alternative using set -e (recommended)
set -e
program_name input.fastq > output.txtProcessing Multiple Files
# Loop through all FASTQ files
for file in "$INPUT_DIR"/*.fastq; do
basename=$(basename "$file" .fastq)
echo "Processing $basename..."
program_name "$file" > "$OUTPUT_DIR/${basename}.processed.txt"
doneBest Practices
- Always use error handling with
set -e - Quote variables to handle spaces in filenames:
"$variable" - Log output to track progress and debug issues
- Use meaningful variable names for clarity
- Comment your code to explain complex operations
- Test on small datasets before running on full data
Example: Complete Processing Pipeline
#!/bin/bash
set -euo pipefail
# Configuration
INPUT_DIR="data/raw"
OUTPUT_DIR="data/processed"
THREADS=8
LOG_FILE="processing.log"
# Create directories
mkdir -p "$OUTPUT_DIR"
# Start logging
exec > >(tee "$LOG_FILE") 2>&1
echo "Pipeline started: $(date)"
# Process each sample
for fastq in "$INPUT_DIR"/*.fastq.gz; do
sample=$(basename "$fastq" .fastq.gz)
echo "Processing sample: $sample"
# Quality control
fastqc "$fastq" -o "$OUTPUT_DIR"
# Alignment
bwa mem -t "$THREADS" reference.fa "$fastq" | \
samtools sort -@ "$THREADS" -o "$OUTPUT_DIR/${sample}.bam"
# Index
samtools index "$OUTPUT_DIR/${sample}.bam"
done
echo "Pipeline completed: $(date)"Further Resources
- Bash scripting guide
- Shell script best practices
- Workflow management systems (Snakemake, Nextflow)