Lecture notes
Introduction
Workshop goals
- This workshop is meant for people who are already reasonably comfortable with programming in Python (but not necessarily experts).
- “I want to change my code from Python to C++ so it runs faster”
- There are excellent solutions to accelerate performance in Python (Numba, Cython, PyPy, Mojo?).
- It may be easier to incorporate one of these than learning a whole new language (but it may require a high level of expertise).
- Naïve C++ code is generally faster though.
- We’re not going to talk about profiling and optimization.
- We’re going to learn state of the art C++. In the “wild”
things could be different:
- Joining an existing project, you may find yourself constrained by what features you may use.
- E.g. you may have to use a compiler that doesn’t implement the features you want.
- Other than the C++ language there is a rich ecosystem that you
should eventually learn about:
- compilers
- static & dynamic analysis tools
- debugging, profiling, unit testing
- external libraries
- build systems
- package managers
- editors and IDEs
- Setting up a local C++ work environment can be different for
everybody.
- You can look up online how to set it up on your particular computer.
- For uniformity, we’ll use the SciNet Teach cluster (or Niagara).
- You can edit files on the cluster using vim or nano.
- Or you can edit files locally on your machine, and copy them to the cluster.
History of C++
- 1980 to 1985: “C with Classes”
- Development started by Bjarne Stroustrup in the early 1980s.
- The goal was to add useful features (primarily object-oriented) to the C language.
- Influenced by Simula, a language from the 1960s.
- Python is actually not much younger than C++, first introduced in 1991.
- 1985 to 1998: “Pre-standard C++”
- Starting with the commercial release of the language in 1985.
- It took 4–5 years for early C++ to stabilize.
- Then about another decade to really figure out what the language was good for and how to use it and support it.
- Around 1994 we see introduction of the standard template library (also known as STL).
- STL (now just the standard library) made quite a big impact on the direction of the language, and Stroustrup credits it in actually saving the language.
- 1998 to 2011: “Old standard C++”
- Then from the late 1990s, there was not much new in terms of features, but the language has finally been standardized, and work started on what would become C++11.
- Since 2011: “Modern C++”
- Then since 2011, we are again in an era of relatively rapid change, where C++ fortunately is being refined and improved, partly in response to changes in hardware, for example, now threading and concurrency are a part of the language.
The C++ standards are equivalent to Python versions. Features are added to new standards (and rarely obscure features are removed). The difference to Python is that Python has a reference implementation (CPython) which is the most popular and its version number corresponds to the version of the language. In the C++ world, compiler versions don’t correspond to language standard. Compiler vendors pick and choose which features to implement in which version, so compilers are usually somewhat behind the latest standard.
The standard is comprised of the core language and standard library. “Core language” refers to the basic syntax, but the standard also describes a set of libraries that are an integral part of C++.
C++ is a (mostly) backward compatible language, that means there is a lot of historical baggage (or “technical debt”) in the language, i.e. concepts and idioms that are obsolete but still valid. You will encounter them in the wild, but be careful not to adopt them into your coding style.
The current standard of the language is called C++23.
A few words about C
- C is a different language to C++.
- C knowledge not needed to learn C++.
- C and C++ are completely different in terms of philosophy.
- C code is (almost always) valid C++ code.
- C code is (almost always) bad C++ code.
- C has its place, but it is almost certainly not what you’re looking
for.
- C doesn’t have a lot of abstractions that are useful in scientific programming.
- C is more difficult than C++ to code in (opinion).
- C is the preferred language for many libraries (or their APIs),
because
- C has a very stable ABI.
- C APIs can be accessed from C++ and most other languages.
- C# is also something different.
“If you think like a computer, writing C actually makes sense.”
– Linus Torvalds
Is C++ the best language?
No. Because there is no such thing. There are several aspects to consider:
- Does it have an easy and elegant syntax (short programs)?
- Is it easy to learn?
- Does it produce “safe” programs, or are they error prone?
- Does it produce fast programs?
- Does it have a large ecosystem?
- Does it have a large and helpful community?
C++ is probably not the best with respect to the first two points, not the best in terms of the third point but much better than people think, and very good in terms of the last three points.
A quick comparison with Python
Similarities
- Both are imperative programming languages, meaning that the program flows from one statement to the next.
- Similar constructs:
- variables: case sensitive, variable names follow the same rules: can’t start with a number, can’t have symbols like plus or minus…
- control flow statements: if, for, while…
The syntax is only slightly different; but even use the same keywords of
continue
andbreak
for early exit from loops. - types: in both languages variables have types, and we can create our own types. That is, both strongly support object-orient programming.
- functions (free, including lambdas, or methods in a class).
- zero-based indexing.
Differences
Syntactic differences
(relatively minor and not so important)
- Statements in C++ end in a semicolon (newline is meaningless).
- Blocks† are delimited by curly braces.
- Parentheses are needed after
if
,for
, andwhile
. - Comments start with
//
and end with a new line, or are in a block like this:/* ... */
. - …
† For readability, C++ code should be indented the same way as Python, it’s just not enforced by the language.
Implementation
(usually) C++ is a compiled language, while Python is interpreted. This is a bit of an oversimplification, but the bottom line is:
- In Python, the program is one or more source files. To execute the program, they are read by a program called an interpreter which issues the instructions to the CPU.
- In C++ the source files are compiled (“ahead of time”) to form an executable file (or library), that has the instructions to the CPU in machine language.
The consequence of this is that the compiler can heavily optimize the machine instructions that result from C++ code, which are then run on “bare metal” (kinda) rather than through the interpreter, which is generally (much) faster.
That being said, Python code is rarely pure Python. Packages like NumPy and SciPy are actually written in compiled languages but have Python interfaces, so code that heavily uses such packages may not gain much speedup by being rewritten in C++. Accelerating Python code is an art in itself, possibly transitioning to C++ is just one aspect of that.
Another consequence is that erroneous coding in C++ can lead to compile-time errors. It’s often very useful to catch errors at this stage, rather than at runtime. The compiler can also detect some issues that may cause runtime errors, and issue warnings about them.
Type system
C++ is more “strict” with regards to type of variables. Unlike Python, a variable is associated with just one† type, and a function’s parameter types and return type are also determined in advance††.
† Cf. std::variant
and std::any
for the
very uncommon scenarios where a single variable may need to
hold values of different types at different times.
†† With templates in C++ we can write a function without really specifying types. That is actually very common, and among other things helps preventing code duplication, but we won’t touch on that in this workshop.
C++ theory
A “Hello, World!” program
Setting up environment on the Teach cluster
Use an SSH client program to connect to the Teach cluster (
teach.scinet.utoronto.ca
) using your credentials.Type the following commands to get the course material:
cd $SCRATCH tar xf /scinet/course/scmp241/py2cpp.tar.bz2 cd py2cpp
👉 The source file can also be downloaded from here: https://scinet.courses/mod/resource/view.php?id=2948
Activate the environment:
source activate
Optionally split your screen by typing
tmux
, then press Ctrl+B followed by ” (quotation mark).- You can toggle between the panes using Ctrl+B followed by ; (semicolon).
Create a file called
hello.cpp
in the current folder, and type this content:#include <print> int main() { std::println("Hello, World!"); }
Switch to the other pane (or exit the editor) and build the program with:
g++ hello.cpp
.Now run the program with:
./a.out
.
👉 We can choose a different name for the output instead of
a.out
, for example if we want the executable file to be
called hello
, the command would be
g++ -o hello hello.cpp
.
Notes about the code
- Include directive. You can think of it like
import
in Python, in this case we “imported”std::println
.- In practice it is quite different, we actually included a header file.
- Header files contain (mostly) declarations of the functions (etc.) available in some library.
- Declaration-implementation separation is important in multi-file projects.
- It is being phased out in favour of modules (will use the
import
keyword).
- Main function (we’ll talk more about it later).
- Note curly braces and semicolon.
- The
::
symbol just means that there is a function calledprintln
in a namespace calledstd
. For now just think aboutstd::println
the name of a function.- Standard library functions, classes, and objects exist in this
std
namespace.
- Standard library functions, classes, and objects exist in this
- String constructed with double quotes only (single quotes used for a single character).
About printing
std::println
is a brand new part of the C++ standard, and not widely available in compilers as of 2023.std::println
always prints a new line at the end, usestd::print
if that is not desired.
A more traditional hello-world program looks like this:
#include <iostream> int main() { std::cout << "Hello, World!\n"; }
The
<<
operator is used to “insert” the string to thecout
stream; in other words just print it. We’ll talk more about streams at the end.Here, we have to specify newline character (
\n
) explicitly if we need it.
📖 Extra information
This is a more detailed explanation of what is going on when running
the g++
command. It is not critical to understand at this
point.
g++
is the invocation command for the GNU Compiler Collection (also know as GCC)- There are many others compilers. GCC is reliable and commonly used.
- What
g++
does is quite complex, technically “compilation” is just one of several steps. - In a project with multiple
.cpp
files, the files are compiled separately and linked together into one executable at the end.- The
.cpp
files are called compilation units.
- The
- There are many parameters we may need to pass to
g++
, including where to find external libraries and, importantly, optimization flags. - Most projects use some build automation software to manage
the build process.
- Even for small projects it’s recommended to create a Makefile, a script used by GNU Make tool.
- When the build is complex (especially when there are external dependencies), CMake can be used to create the Makefile.
- That’s outside the scope of the workshop, but very important in the real world.
About portability
- A Python program will run everywhere where an interpreter and required packages are installed.
- C++ source code will compile and run everywhere where a compiler and required libraries are installed.
- The C++ executable though will not necessarily run on systems other
than the one it was built for.
- It contains machine language instructions that are specific to (usually) one CPU architecture
- The executable format and syscalls are operating system-specific.
- The executable is usually dynamically linked to some libraries, and their presence in the system running the application is required.
The bottom line is that the executable is not very portable.
Variables
Unlike in Python, variables have to first be declared (with a type) before a value is assigned. They can (and should) be simultaneously initialized. For example (“copy initialization”):
int a = 5;
There are many other styles to do the same thing. Bjarne Stroustrup’s favourite way is “list initialization”:
int a {5};
It is almost† equivalent to the first style for
int
.
The auto
keyword can be used instead of the type:
auto a {5};
auto a = 5;
In this context, this keyword means “the type is whatever the type is on the right-hand side”, and should be used when we don’t exactly know or care what is on the right-hand side.
The variable declared like so still has a type (int in this
case), it’s just deduced by the compiler. Where and how often to use
auto
is a personal choice.
⚠️ Don’t declare variables too far from where they are needed in the code.
† The difference is that list initializing disallows narrowing conversion.
Nature of variables in Python and C++
In short, Python variables are references to objects, C++ variables are the objects.
In Python, the statement x = 1234
creates an object
somewhere in memory and the label x
refers to that object.
When we later modify the value (e.g. x = 5678
or
x = 'hello'
), another object is created somewhere else in
memory, and the label x
is reassigned to refer to that new
object. The original object (holding the value 1234) is not necessarily
immediately destroyed and its memory freed, but is garbage
collected eventually.
In C++, the statement int x = 1234;
(or its equivalents)
creates an object somewhere in memory. In contrast with Python,
modifying x
actually modifies the value in the memory
location where 1234 is stored; that is partly why you can’t use
x
for values of another type. The integer object is
immediately destroyed and the memory freed when x
goes
out of scope.
You can also think about it as a different behaviour of the
=
operator in the two languages.
This difference will come up again when we talk about how to pass parameters to functions.
Scoping rules
In Python, resolution of what x
actually refers to in
any specific context follows the “LEGB” rule, and can get quite
confusing.
Scoping rules for C++ are very simple: a variable x
is
accessible after it is defined (it can only be defined once), in the
same scope and all inner scopes. The name x
can be reused
for a different variable in an inner scope (that is called
shadowing and is a bad practice).
For example, this is not legal:
{
int a = 5;
float a = 1.234;
}
But this is fine:
{
int a = 5;
}
{
float a = 1.234;
}
Types
Similar to Python, we have
- “Primitive” types:
int
,char
,bool
,float
, &double
(sometimes called fundamental types).int
is an integer type, unlike in Python, it has a fixed range.- The range is actually OS-dependent†, but almost always ranges between approx. –2 and +2 billion (4 bytes or 32 bits).
- The range can be adjusted by using the
unsigned
/signed
and/orshort
/long
type modifier keywords.
char
is a numeric type with the range -128 to +127 (1 byte).- It can be marked
unsigned
to range between 0 and 255. - It’s used also to store a single ASCII character.
- It can be marked
float
anddouble
are a 32- and 64-bit IEEE-754 floating point types.- Python’s
float
is usually 64 bit. - Many systems support
long double
. It may be 128-bit wide but that’s implementation-dependant.
- Python’s
- User-defined types (classes, including standard library classes).
- We’ll see standard library containers later which are important classes.
👉 Like in Python, classes are types, and the terms may be used interchangeably.
† To specify exactly what width of integer is needed, one can use fixed width integer types.
Type casting
⚠️ the following is legal but doesn’t work the same way as in Python:
{
int a = 5;
a = 1.234;
}
The value of a
at the end of the block is 1! Since it
can only hold an integer, the value 1.234 is truncated (a narrowing
conversion). By default GCC won’t complain if we do that, but we
can pass the -Wconversion
flag to enable warnings. If we
use list initialization (a = {1.234};
) we’ll get an
error.
What happens if we try to assign a string literal to
int a
? Casting is not possible in this case.
As a side note, conversion from int to float is also narrowing, because integral numbers larger than 224 may not be represented accurately by the 32-bit float type. Conversion from int to double, however, is not narrowing.
The “safe” way to cast from one type to another is using
static_cast
. In case the conversion is narrowing, it is a
way to promise the compiler that we know what we’re doing, so it doesn’t
raise errors or warnings. It could be useful to explicitly convert
types, for example, when we divide one integer by another, it results in
integer division like in old Python 2 (or the //
operator
in Python 3). To get around that, one of the numbers has to be cast
(i.e. to a double).
auto a {3};
auto b {2};
auto c = a/b; // c is an integer with the value 1
auto d = static_cast<double>(a)/b; // d is a double with the value 1.5
const
and
constexpr
Variables of any type in C++ can be made read-only with the
const
keyword, and some types can also be
constexpr
.
constexpr
is “stronger” because it implies that the
value is not only unchangeable, but also known at compile time. It can
still be a calculated value, but the result of the calculation has to be
known at compile time.
Whenever we create a variable that is not expected to change, it’s
strongly recommended to mark it constexpr
or
const
(if not calculable at compile time).
This is important to help the compiler optimize the code, and may also prevent mistakes.
Functions
See example:
examples/02_functions.cpp
- A function has a signature: one return type and zero or
more input types.
- In Python it is also a good idea to declare input and output types!
- In C++ it is perfectly fine to have functions with the same name but different signatures. That is called function overloading.
- The return type can be “
void
” for a function that doesn’t actually return anything. - Default parameters (like in Python, have to come at the end).
- If we want a function to return more than one number (or any other
object), we can make the return type a tuple.
- Notice the if-statement in the tuple example: if the body of a true/false clause or a loop has only one statement, the curly braces are optional.
- Input parameters are passed by value (copy) by default, which can be very bad. More later about how to pass by reference.
- Functions can be “anonymous” and may be used in place or assigned to variables, these are lambdas. We’ll see them later.
⚠️ If a function grows too big, has too many nesting levels, or too many parameters, it should probably be split into multiple smaller functions.
The main()
function
- As we’ve seen, this is the entry point to the program.
- If we create an executable, we have to have
main
, otherwise (library) we shouldn’t have a function by that name. - The return type should be always be
int
, and upon successful termination should return zero (and an error code otherwise). - Unlike all other non-void functions, it’s OK to not have a
return
statement, in which case the exit code it zero. main
can have arguments, which is how command line parameters can be passed to the program.
Templates
This is an important and big topic in C++ but we won’t touch on it here beyond these notes. This idea doesn’t even exist in Python because Python programming is generic by design.
- We can make a function generic by using
auto
as a placeholder for the type in the declaration. - This makes this function a template.
- Templates are compiled as needed, as opposed to normal functions
that are always compiled.
- Meaning when it is used in the code (instantiated), the compiler see what the parameters types actually are and creates a specialization of the template.
To use (or instantiate) a template, if the types cannot be
automatically deduced, they have to be specified in angle brackets
<>
as we will see in the examples.
📖 Extra information
The idea of templates goes well beyond a placeholder type.
- Classes can also be templates.
- Using
auto
like so is the abbreviated function template syntax. - There is also a “full” syntax for template declaration.
- Templates should be used with concepts to reduce errors and increase readability.
- Template meta-programming can get really complicated.
Standard containers
The C++ standard library provides some useful containers.
std::vector
This is the most important container, similar to a Python list, but all elements have the same type (so more similar to a NumPy 1d array). The type has to be indicated as a template parameter if it can’t be deduced. This container can be used as a stack (i.e. push and pop), and has random access.
See example:
examples/03_vector.cpp
- To use a vector we must
#include
the<vector>
header. - The
std::vector
class has multiple constructors.- Template argument are sometimes needed to specify the value type.
- The class has many useful methods:
size
,push_back
,empty
, andat
. - Elements are accesses with the
at
method.
For a full list of constructors and other methods, see here.
⚠️ Accessing elements with square brackets []
like in
Python is possible but not recommended because there are no bounds
checks.
Other containers
std::array
is the same asstd::vector
but with fixed size that is known at compile-time.std::unordered_map
is the equivalent of a Python dictionary.- There is also
std::map
, but the unordered version is usually what you want (performance differs).
- There is also
std::unordered_set
is the equivalent of a Python set.std::valarray
a bit old-fashioned but not deprecated, similar to a NumPy array in that it supports element-wise mathematical operations, slicing, and reductions. You can do all these withstd::vector
but need to manually define these operations. These are not as powerful as using a linear algebra library like Eigen and Armadillo.- There are many other containers, but the above cover almost all use cases.
References (and pointers)
See example:
examples/04_references.cpp
- References can be used to create an alias variable.
- The real power is that if a function has a reference type in its signature, the variable is passed to it by reference, so copying is avoided.
- The function
my_func_by_value
actually doesn’t do anything. - The function
my_func_by_reference
successfully mutatesa
in place.- In this case, the parameter is called an in/out parameter.
- Mutating an in/out parameter is called a “side effect” of the
function.
- Functions without side effects are called “pure”.
- Pure functions are easier for the compiler to optimize and for humans to understand.
- If all the function needs to do is to mutate an int or something
like that, better keeping it a pure function by just passing by value
and returning the result.
- If the goal is to mutate a few ints, prefer to return a tuple or a struct.
Parameters of “small” types (e.g. int, even double) can and should be passed by value, but anything bigger (e.g. standard containers) has to be passed by reference, and very often const reference if it needs not be modified.
Passing by reference in Python 🐍
In Python there is not much choice, everything is passed by
reference. But remember that the assignment operator =
reassigned the label. So if we have a function like:
def my_func_assignment(param):
= param * 2 param
It will not change the original a
. It will create a new
object with the value param * 2
and assigning it to a
local variable param
.
There is a subtle difference between param = param * 2
and param *= 2
though. For int and primitive types like
that, it’s the same. But for classes, the augmented assignment operators
(such as +=
, *=
, …) can be overloaded in such
a way that they mutate the object. See
examples/04_references.py
for an example where an input
parameter can be mutated if it is a class object.
Similarly, calling methods on an input variable of some class type may mutate the variable.
About pointers
- Pointers are variables that hold memory addresses.
- Of other variables, or elements in a container.
- May point to manually managed memory.
- They are very rarely needed in modern C++ proper, because:
- References provide a very similar functionality.
- With standard containers there is usually no need to manually managed memory.
- Improper use of pointers leads to memory bugs like leaks and access
violations (segmentation fault).
- References are almost always safe (the main exception being a dangling reference, e.g. when a function returns a reference to a variable created in its scope).
- Pointers are mostly useful when interacting with or wrapping a C
library (since C has no reference types).
- In C, using pointers is necessary in complex programs.
- In situations where pointers are really needed, you should use smart pointers.
- In Python you can get the memory address of a variable with the
id
builtin, but that’s about as far as Python goes in supporting pointers.
Loops
See example:
examples/05_loops.cpp
- Range-based for loops are the “workhorse” of C++.
- Inside the parentheses, on the left is the range declaration. It’s similar to a variable declaration.
- On the right after the colon is the range expression. In the first example it is just the container. In the second, it is the container modified by a range adaptor.
- C-style loops just increment an index variable until some condition
is met. The index appears three times in the loop’s header and it’s
surprisingly easy to mess it up.
- Notice the increment operator
++
, that is where the language gets its names! - The meaning of
idx++;
is exactly the same asidx += 1;
, which is also valid in C++. - It doesn’t have to be
idx++
on the right, you can decrement with the--
operator, or use a custom stride.
- Notice the increment operator
std::views::iota
is similar torange
in Python, it is lazily evaluated (a “range factory” in C++ terminology).- Both arguments are needed to create a finite
range!!
std::views::iota(5)
is an infinite series starting with 5. iota
is quite limited, can only go forward in increments of one.- The index can be declared
const
.
- Both arguments are needed to create a finite
range!!
for_each
is an algorithm from the standard library that executes some function for each element in the container.- It may or may not modify the element.
- We used a lambda function as the second argument.
- There are many more algorithms! We’ll see some of them later.
C++ practice
Monotonicity
Our goal is to write a function that accepts a
std::vector
of ints, and returns true if the sequence is
strictly increasing, i.e. each number is bigger than the one
before it in the sequence. Look at the program in file
examples/06_monotonicity.py
and translate it to C++. Things
to keep in mind:
- How to pass the input? By value, by reference, or const reference?
- Which for loop-style is suitable here?
- In C++,
true
andfalse
are in lower case. - You can use words like
not
,and
, &or
like in Python. But it is more common to use the corresponding symbolic operators, which are!
,&&
, &||
, respectively.
Luhn algorithm
The last digit of debit/credit card numbers, as well as OHIP number, SIN and other identifiers, is actually a check digit. It’s purpose is to distinguish a valid number from mistyped or otherwise incorrect numbers. See Wikipedia for more details. The check digit is calculated from the other digits using a simple algorithm.
The Luhn algorithm: starting from the right side (excluding check
digit) multiply every other digit by 2. If result if the multiplication
is 10 or bigger, sum the two digits of the result (equivalently,
subtract 9). Then, sum all results as sum
. Finally, the
check digit is (10 - (sum % 10)) % 10
.
Example: 2445394258811369
We drop the last digit and start from the right (6) multiplying every other digit by 2 and following the other steps to get the answer:
2 4 4 5 3 9 4 2 5 8 8 1 1 3 6
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
4 4 8 5 6 9 8 2 10 8 16 1 2 3 12
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
4 4 8 5 6 9 8 2 1 8 7 1 2 3 3
sum = 71
check digit = (10 - (71 % 10)) % 10 = (10 - 1) % 10 = 9 % 10 = 9
So in this example, the number is valid as we retrieved the expected
check digit. Look at the program in file
examples/07_luhn.py
and translate it to C++. Things to keep
in mind:
- Are there any variables that should be marked as
const
orconstexpr
? - An easy way to access the last element of a vector is with the
.back()
method.- You cannot use negative indices to access from the end like in Python.
In the solution file examples/07_luhn.cpp
we used
range adaptors to reverse the vector and drop the first
element, then the view could be used in a range-based for loop. It’s
perfectly acceptable doing the loop “C-style” or using iota
but may be less elegant (you can for example start from the left, but
pay attention on whether to start multiplying at the first or second
digit from the left).
There is also a loop-free implementation example
(examples/07_luhn_alternative.cpp
) that makes heavy use of
range adaptors.
More Standard Library
If we got so far, congratulations! That was the basic idea of how C++
works compared to Python. There is a lot more to learn,
in particular about the standard library. We already saw standard
library objects (cout
), containers (vector
),
views (iota
), and algorithms (for_each
),
that’s the tip of the iceberg.
Word frequencies
The goal is to count the number of times each word appears in a text
file (The Complete Works of William Shakespeare). The Python
version† can be found in
examples/08_word_frequencies.py
. Let us take some time to
understand it first before looking at the C++ version.
In this example we see:
- Reading a text file into a string.
- Using the
for_each
algorithm to modify a string in-place. - The
using
keyword to create type aliases. - The
unordered_map
container. - Reading from a (string) stream with the
>>
operator. - The condition of while loop also performs an operation that returns
true or false (think
:=
in Python) - The standard
pair
class. - Constructing a
std::vector
of pairs from astd::unordered_map
. - The
sort
algorithm. - The
take
view (a range adaptor). - The
|
operator in ranges. - Structured binding.
Streams
The C++ standard library uses streams for text-oriented I/O.
We already met cout
which is a character stream (a global
object), there are also file, string, and span streams that we can
create as needed. The idea behind streams is to provide a common
serial†† interface to format the data, regardless of
what device is used to communicate them.
This is a bit boring, in real life you may not need to rely on streams so much, but it’s useful here when working with text.
In the example, the function read_text
creates an input
file stream f
. The file is implicitly open when the object
f
is created, and closed automatically when it goes out of
scope. In this case we read all of it into a buffer using the
read
method. The function count_appearances
creates a string stream from a regular string, so we could read it
word-by-word (whitespace delimited) using the >>
operator. We could also have read the file stream in the same way, but
chose to get the whole text as a string so it could be processed
in-place first (punctuation removed, case lowered).
Alternative implementations
(1) There is also a range factory for streams (the
istream
view), we can use it to replace the while loop by a
for_each
call, if we really wanted to. Is it more readable
though?
std::ranges::for_each(
std::views::istream<std::string>(buffer),
[&](const auto& word){
[word]++;
word_counts}
);
(2) In the Python version, we used the
split
method of the str
class; in C++ there is
a range adaptor std::views::split
that we can use with the
text string directly (no need for the stringstream buffer). Notice
though that elements of this view are subranges rather than normal
strings, so the loop could look like this:
for (const auto& word : std::views::split(text, ' '))
[std::string(begin(word), end(word))]++; word_counts
This will not give us the same result unless additional text processing is done, since the range adaptor is dumb, and only splits with respect to the delimiter and not whitespaces in general (including newline characters).
† There are many ways to do the same thing in both Python and C++.
Python has the collections.Counter
class that is actually
better for this task.
†† As opposed to random access.
Iris data set
The data set has measurement of 150 individual iris flowers of three species. The values in each row are separated by spaces: the first column is the species name, and it is followed by four numeric quantities (the sepal length, sepal width, petal length, and petal width in centimetres, however for our purposes it doesn’t matter what they are).
The goal is to calculate the averages of the four numeric quantities
for each species separately. Preferably, we should do
it without knowing the number of species, number of rows, and the number
of numeric columns should be an adjustable parameter. The Python version
can be found in exercises/01_iris_data.py
.
If this is too difficult, you can try an “easy” version of the
problem first: calculate the average of just one quantity (e.g. the
first) for each species separately. The Python version of that can be
found in exercises/01_iris_data_easy.py
.
Tips
- This time we don’t need a string stream, we can read directly from the file stream.
- The
sums
map has key type of string.- In the “easy” version of the problem, the value type is
float
. - In the full version of the problem, the value type is a
vector<float>
or anarray<float, n>
.
- In the “easy” version of the problem, the value type is
- You can use a while loop to go over the rows like in the example.
In the easy version, you can read a full row like so:
std::string species; float datum, _; while (f >> species >> datum >> _ >> _ >> _) { /* ... */ }
The values we are not interested in will be read into the
_
variable, which we’ll just ignore.In the full version, first read the species name and then use an inner
for
loop to go over the numeric columns. The stream extraction in this case shouldn’t be in the condition of the while loop, instead, you can check if the stream has reached its end using theeof
method, like so:while (!f.eof()) { >> species; f /* extract data using a for loop */ /* ... */ }
- No need to sort this time, just loop over the map the same way we
looped over
word_counts_sorted
in the word frequencies example. Even though that was a vector of pairs rather than a map, thefor
loop is the same.- The result may be printed in a different order than the Python solution, that is fine. Python dictionaries retain insertion order (since Python 3.7), C++ standard unordered maps do not.
The output should be something like:
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.77 4.26 1.326
virginica 6.588 2.974 5.552 2.026
Next steps
🏁 There’s still a lot to learn. Some of the topics that may be of interest for scientific programmers include:
- Multi-file projects and build tools
- Classes (object oriented programming)
- PyBind11
- More standard libraries
- Move semantics
- Multi-threaded applications
- Standard thread library
- OpenMP is an alternative
- Large-scale parallelism with MPI
- Manual memory management and pointers
- You will need them to interact with C libraries such as GSL and MPI
- Some C++ APIs, like HDF5, are very old-fashioned and involve pointers to some extent
- C++ Core Guidelines
- External tools