Collections, Iterators, And Text Data¶
This module introduces collection-oriented programming in Rust. It focuses on vectors, iterator pipelines, hash maps, hash sets, and simple text-file processing. These are the tools that turn earlier examples from small scalar computations into programs that process data.
The examples in this module deliberately stay close to common scientific and technical workflows: read data, store values, transform collections, compute summaries, and count or classify tokens.
Learning Objectives¶
After completing this module, participants should be able to:
- Store homogeneous values in
Vec<T>. - Build vectors incrementally with
push. - Iterate over collections with
iter,iter_mut, andinto_iter. - Use iterator adapters such as
filter,map,zip,unzip, andenumerate. - Use
sumfor simple reductions. - Explain when
copiedis useful. - Explain the role of
collectand why type annotations are sometimes needed. - Use
HashMapto count values. - Use
HashSetto collect unique values. - Read text input with buffered readers.
- Write text output with buffered writers.
- Process input byte by byte when that is appropriate for the file format.
Prerequisites¶
Participants should already be comfortable with:
- Ownership and borrowing.
- Shared and mutable references.
- Functions and closures.
forloops.- Basic error handling with
Resultis useful, but not the main focus here.
The examples used in this module are:
Vectors¶
A Vec<T> stores a growable sequence of values of the same type. The
iterators example reads two numeric columns from a CSV file and stores them in
separate vectors:
cd source-code/iterators
cargo run -- --file data.txt
The relevant setup is:
let mut xs = Vec::new();
let mut ys = Vec::new();
Values are appended with push:
xs.push(value.x);
ys.push(value.y);
The vectors are mutable because reading the file grows them one record at a time.
Reading Structured Text With csv And serde¶
The iterators example uses the csv crate to read records and serde to
deserialize each record into a Rust struct:
#[derive(Deserialize, Debug)]
struct Values {
x: f64,
y: f64,
}
The file is opened through the CSV reader:
let mut reader = csv::Reader::from_path(args.file)?;
Each record is deserialized in a loop:
for result in reader.deserialize() {
let value: Values = result?;
xs.push(value.x);
ys.push(value.y);
}
The important pattern is that external text data is converted into typed Rust
values near the input boundary. The rest of the program can then work with
Vec<f64> rather than raw strings.
Borrowed Iteration And copied¶
Calling iter on a vector yields references to the elements:
xs.iter()
For a Vec<f64>, this produces items of type &f64. When the program wants
independent f64 values in the iterator pipeline, it can use copied:
let filtered_xs: Vec<f64> = xs
.iter()
.copied()
.filter(|x| *x >= 10.0)
.collect();
The copied adapter is appropriate here because f64 is a small scalar type
that implements Copy.
Without copied, the pipeline would operate on references. That is often fine,
but collecting owned scalar values is clearer for this example.
Filtering Values¶
The filter adapter keeps only values that satisfy a predicate:
let filtered_xs: Vec<f64> = xs
.iter()
.copied()
.filter(|x| *x >= 10.0)
.collect();
The closure:
|x| *x >= 10.0
decides whether each value should be kept.
Iterator adapters are lazy. The pipeline does not produce the final vector
until collect is called.
Mapping Values¶
The map adapter transforms each item:
let cubed_xs: Vec<f64> = xs
.iter()
.copied()
.map(|x| x.powi(3))
.collect();
This pipeline reads the x values, copies each scalar value, computes its
cube, and collects the results into a new vector.
This is often clearer than writing a manual loop when the computation is a straightforward element-wise transformation.
Collecting Results¶
The collect adapter consumes an iterator and builds a collection:
let cubed_xs: Vec<f64> = xs
.iter()
.copied()
.map(|x| x.powi(3))
.collect();
The type annotation is important:
Vec<f64>
Rust can usually infer the iterator item type, but it often needs help knowing
which collection type to build. The same iterator could sometimes be collected
into a Vec, a HashSet, or another collection.
Combining Iterators With zip¶
The zip adapter combines two iterators into one iterator over 2-tuples:
let filtered_pairs: Vec<(f64, f64)> = xs
.iter()
.copied()
.zip(ys.iter().copied())
.filter(|(x, _)| *x >= 10.0)
.collect();
Here, the x and y columns are combined again after being stored in separate
vectors. The filter then keeps only 2-tuples whose x value is at least
10.0.
This is useful when two sequences represent related data and should be processed together.
Splitting 2-Tuples With unzip¶
The inverse operation is unzip, which splits an iterator over 2-tuples into
two collections:
let (filtered_xs_unpacked, filtered_ys_unpacked): (Vec<f64>, Vec<f64>) =
filtered_pairs
.iter()
.copied()
.unzip();
The result type is written explicitly:
(Vec<f64>, Vec<f64>)
This tells Rust that the first components should be collected into one vector and the second components into another vector.
Reductions With sum, fold, And scan¶
Some iterator operations reduce many values to one value. The iterators
example computes the sum of the y values:
let sum_y: f64 = ys.iter().sum();
The type annotation tells Rust which numeric type the sum should produce.
For more general accumulation, use fold:
let sum_of_squares = xs
.iter()
.copied()
.fold(0.0, |accumulator, x| accumulator + x * x);
fold carries an accumulator through the iterator and returns the final
accumulated value.
Use scan when the intermediate accumulated states are also part of the
result:
let cumulative_sum: Vec<f64> = xs
.iter()
.copied()
.scan(0.0, |state, x| {
*state += x;
Some(*state)
})
.collect();
This produces the running sum after each input value. Conceptually, fold
returns only the final accumulated value, while scan yields the sequence of
accumulated states.
Adding Indices With enumerate¶
The enumerate adapter attaches an index to each item:
let indexed_xs: Vec<(usize, f64)> = xs
.iter()
.copied()
.enumerate()
.collect();
This produces 2-tuples of the form:
(index, value)
It can also be used directly in a loop:
for (i, y) in ys.iter().enumerate() {
println!("Index: {i}, y value: {y:.1}");
}
This is usually preferable to manually maintaining a separate counter.
Hash Maps For Counting¶
A HashMap<K, V> stores values by key. The hashmap-hashset example uses a
hash map to count nucleotide characters:
cd source-code/hashmap-hashset
cargo run --bin count-nucleotides -- --file errors.txt
The count map is created with:
let mut counts = HashMap::new();
Each valid nucleotide updates its count:
*counts.entry(nucleotide).or_insert(0) += 1;
This pattern is common enough to read carefully:
entry(nucleotide)selects the map entry for that key.or_insert(0)inserts0if the key was not present.*... += 1increments the value stored in the map.
After processing the file, the program ensures that every valid nucleotide has an entry:
for nucleotide in VALID_NUCLEOTIDES {
counts.entry(nucleotide).or_insert(0);
println!("{nucleotide}: {}", counts[&nucleotide]);
}
This makes the output stable even if a nucleotide did not occur in the input.
Hash Sets For Unique Values¶
A HashSet<T> stores unique values. The same example uses a hash set to record
which invalid tokens appeared in the input:
let mut error_tokens = HashSet::new();
When an invalid token is found, it is inserted:
error_tokens.insert(nucleotide);
If the same invalid token appears many times, the set still stores it once.
That makes HashSet a natural choice when the question is "which values were
seen?" rather than "how many times did each value occur?"
Buffered Text Input¶
The nucleotide-counting example reads a text file through a buffered reader:
let file = std::fs::File::open(args.file)
.expect("Failed to open the DNA sequence file");
let reader = BufReader::new(file);
Buffered input avoids asking the operating system for tiny pieces of data one at a time. This matters for larger files.
The example then iterates over bytes:
for byte in reader.bytes() {
let nucleotide = byte.expect("Failed to read the DNA sequence file") as char;
// process nucleotide
}
Byte-wise processing is appropriate here because the input is simple ASCII-like sequence data. For general Unicode text, line-based or string-based processing is usually more appropriate.
Buffered Text Output¶
The data-generation and error-injection programs use buffered writers:
let file = std::fs::File::create(args.file).expect("Unable to create file");
let mut output = std::io::BufWriter::new(file);
Values are written with write! and writeln!:
write!(output, "{random_nucleotide}").expect("Unable to write file");
writeln!(output).expect("Unable to write file");
Buffered output is the counterpart to buffered input: it groups many small writes into fewer larger writes.
Matching While Processing Input¶
The nucleotide-counting example classifies each character with match and
match guards:
match nucleotide {
nucleotide if is_valid_nucleotide(nucleotide) => {
*counts.entry(nucleotide).or_insert(0) += 1;
}
nucleotide if nucleotide.is_whitespace() => {}
_ => {
error_tokens.insert(nucleotide);
}
}
The cases are:
- valid nucleotide: increment its count;
- whitespace: ignore it;
- anything else: record it as an error token.
This combines pattern matching with collection updates.
Suggested Hands-On Work¶
Use this sequence as a practical lab.
- Run the iterator example:
bash
cd source-code/iterators
cargo run -- --file data.txt
-
Change the
filterthreshold from10.0to another value and inspect the output. -
Add a
mappipeline that computesx.sqrt()for all non-negativexvalues. -
Add a
foldexpression that computes the sum of squares of theyvalues. -
Add a
scanexpression that computes the cumulative sum of theyvalues. -
Use
zipto compute a vector ofx + yvalues. -
Run the nucleotide-counting example:
bash
cd source-code/hashmap-hashset
cargo run --bin count-nucleotides -- --file errors.txt
-
Add a second
HashMapthat counts invalid tokens instead of storing only the unique invalid tokens. -
Change the output order by printing the contents of the map directly, then compare that with iterating over
VALID_NUCLEOTIDES. -
Run the data generator and error injector to produce a new input file:
bash cargo run --bin generate-data -- --count 200 --file data.txt cargo run --bin read-errors -- --file data.txt --output errors.txt --error-rate 0.2 cargo run --bin count-nucleotides -- --file errors.txt
Discussion Points¶
This module is a good place to emphasize:
- Iterators describe a sequence of processing steps.
- Iterator adapters are lazy until consumed by
collect,sum, aforloop, or another consuming operation. - Type annotations on
collecttell Rust what collection to build. copiedis useful when moving from borrowed scalar values to owned scalar values.HashMapis useful for counts and lookup tables.HashSetis useful for uniqueness.- Buffered I/O is a sensible default for file-based text processing.
- Choose byte-wise, line-wise, or record-wise processing based on the input format.
Connection To Later Modules¶
Collection and iterator patterns appear throughout larger Rust programs:
- Project-organization examples reuse collection-processing code from multiple binaries.
- Error-handling examples make file and parse failures explicit.
- Randomness examples generate collections of synthetic data.
- Julia set examples fill matrix-like storage with computed values.
- The N-body example iterates over particles, forces, diagnostics, and output records.
Once participants are comfortable processing collections and text data, they are ready to study error handling and then the project organization needed for larger examples.