Introduction

Workshop goals

  • This workshop is meant for people who are already reasonably comfortable with programming in Python (but not necessarily experts).
  • “I want to change my code from Python to C++ so it runs faster”
    • There are excellent solutions to accelerate performance in Python (Numba, Cython, PyPy, Mojo?).
    • It may be easier to incorporate one of these than learning a whole new language (but it may require a high level of expertise).
    • Naïve C++ code is generally faster though.
    • We’re not going to talk about profiling and optimization.
  • We’re going to learn state of the art C++. In the “wild” things could be different:
    • Joining an existing project, you may find yourself constrained by what features you may use.
    • E.g. you may have to use a compiler that doesn’t implement the features you want.
  • Other than the C++ language there is a rich ecosystem that you should eventually learn about:
    • compilers
    • static & dynamic analysis tools
    • debugging, profiling, unit testing
    • external libraries
    • build systems
    • package managers
    • editors and IDEs
  • Setting up a local C++ work environment can be different for everybody.
    • You can look up online how to set it up on your particular computer.
    • For uniformity, we’ll use the SciNet Teach cluster (or Niagara).
    • You can edit files on the cluster using vim or nano.
    • Or you can edit files locally on your machine, and copy them to the cluster.

History of C++

  • 1980 to 1985: “C with Classes”
    • Development started by Bjarne Stroustrup in the early 1980s.
    • The goal was to add useful features (primarily object-oriented) to the C language.
    • Influenced by Simula, a language from the 1960s.
    • Python is actually not much younger than C++, first introduced in 1991.
  • 1985 to 1998: “Pre-standard C++”
    • Starting with the commercial release of the language in 1985.
    • It took 4–5 years for early C++ to stabilize.
    • Then about another decade to really figure out what the language was good for and how to use it and support it.
    • Around 1994 we see introduction of the standard template library (also known as STL).
    • STL (now just the standard library) made quite a big impact on the direction of the language, and Stroustrup credits it in actually saving the language.
  • 1998 to 2011: “Old standard C++”
    • Then from the late 1990s, there was not much new in terms of features, but the language has finally been standardized, and work started on what would become C++11.
  • Since 2011: “Modern C++”
    • Then since 2011, we are again in an era of relatively rapid change, where C++ fortunately is being refined and improved, partly in response to changes in hardware, for example, now threading and concurrency are a part of the language.

The C++ standards are equivalent to Python versions. Features are added to new standards (and rarely obscure features are removed). The difference to Python is that Python has a reference implementation (CPython) which is the most popular and its version number corresponds to the version of the language. In the C++ world, compiler versions don’t correspond to language standard. Compiler vendors pick and choose which features to implement in which version, so compilers are usually somewhat behind the latest standard.

The standard is comprised of the core language and standard library. “Core language” refers to the basic syntax, but the standard also describes a set of libraries that are an integral part of C++.

C++ is a (mostly) backward compatible language, that means there is a lot of historical baggage (or “technical debt”) in the language, i.e. concepts and idioms that are obsolete but still valid. You will encounter them in the wild, but be careful not to adopt them into your coding style.

The current standard of the language is called C++23.

A few words about C

  • C is a different language to C++.
    • C knowledge not needed to learn C++.
    • C and C++ are completely different in terms of philosophy.
  • C code is (almost always) valid C++ code.
    • C code is (almost always) bad C++ code.
  • C has its place, but it is almost certainly not what you’re looking for.
    • C doesn’t have a lot of abstractions that are useful in scientific programming.
    • C is more difficult than C++ to code in (opinion).
  • C is the preferred language for many libraries (or their APIs), because
    • C has a very stable ABI.
    • C APIs can be accessed from C++ and most other languages.
  • C# is also something different.

“If you think like a computer, writing C actually makes sense.”

– Linus Torvalds

Is C++ the best language?

No. Because there is no such thing. There are several aspects to consider:

  • Does it have an easy and elegant syntax (short programs)?
  • Is it easy to learn?
  • Does it produce “safe” programs, or are they error prone?
  • Does it produce fast programs?
  • Does it have a large ecosystem?
  • Does it have a large and helpful community?

C++ is probably not the best with respect to the first two points, not the best in terms of the third point but much better than people think, and very good in terms of the last three points.

A quick comparison with Python

Similarities

  • Both are imperative programming languages, meaning that the program flows from one statement to the next.
  • Similar constructs:
    • variables: case sensitive, variable names follow the same rules: can’t start with a number, can’t have symbols like plus or minus…
    • control flow statements: if, for, while… The syntax is only slightly different; but even use the same keywords of continue and break for early exit from loops.
    • types: in both languages variables have types, and we can create our own types. That is, both strongly support object-orient programming.
    • functions (free, including lambdas, or methods in a class).
    • zero-based indexing.

Differences

Syntactic differences

(relatively minor and not so important)

  • Statements in C++ end in a semicolon (newline is meaningless).
  • Blocks are delimited by curly braces.
  • Parentheses are needed after if, for, and while.
  • Comments start with // and end with a new line, or are in a block like this: /* ... */.

† For readability, C++ code should be indented the same way as Python, it’s just not enforced by the language.

Implementation

(usually) C++ is a compiled language, while Python is interpreted. This is a bit of an oversimplification, but the bottom line is:

  • In Python, the program is one or more source files. To execute the program, they are read by a program called an interpreter which issues the instructions to the CPU.
  • In C++ the source files are compiled (“ahead of time”) to form an executable file (or library), that has the instructions to the CPU in machine language.

The consequence of this is that the compiler can heavily optimize the machine instructions that result from C++ code, which are then run on “bare metal” (kinda) rather than through the interpreter, which is generally (much) faster.

That being said, Python code is rarely pure Python. Packages like NumPy and SciPy are actually written in compiled languages but have Python interfaces, so code that heavily uses such packages may not gain much speedup by being rewritten in C++. Accelerating Python code is an art in itself, possibly transitioning to C++ is just one aspect of that.

Another consequence is that erroneous coding in C++ can lead to compile-time errors. It’s often very useful to catch errors at this stage, rather than at runtime. The compiler can also detect some issues that may cause runtime errors, and issue warnings about them.

Type system

C++ is more “strict” with regards to type of variables. Unlike Python, a variable is associated with just one type, and a function’s parameter types and return type are also determined in advance††.

† Cf. std::variant and std::any for the very uncommon scenarios where a single variable may need to hold values of different types at different times.

†† With templates in C++ we can write a function without really specifying types. That is actually very common, and among other things helps preventing code duplication, but we won’t touch on that in this workshop.

C++ theory

A “Hello, World!” program

Setting up environment on the Teach cluster

  1. Use an SSH client program to connect to the Teach cluster (teach.scinet.utoronto.ca) using your credentials.

  2. Type the following commands to get the course material:

    cd $SCRATCH
    tar xf /scinet/course/scmp241/py2cpp.tar.bz2
    cd py2cpp

    👉 The source file can also be downloaded from here: https://scinet.courses/mod/resource/view.php?id=2948

  3. Activate the environment:

    source activate
  4. Optionally split your screen by typing tmux, then press Ctrl+B followed by ” (quotation mark).

    • You can toggle between the panes using Ctrl+B followed by ; (semicolon).
  5. Create a file called hello.cpp in the current folder, and type this content:

    #include <print>
    
    int main()
    {
        std::println("Hello, World!");
    }
  6. Switch to the other pane (or exit the editor) and build the program with: g++ hello.cpp.

  7. Now run the program with: ./a.out.

👉 We can choose a different name for the output instead of a.out, for example if we want the executable file to be called hello, the command would be g++ -o hello hello.cpp.

Notes about the code

  • Include directive. You can think of it like import in Python, in this case we “imported” std::println.
    • In practice it is quite different, we actually included a header file.
    • Header files contain (mostly) declarations of the functions (etc.) available in some library.
    • Declaration-implementation separation is important in multi-file projects.
    • It is being phased out in favour of modules (will use the import keyword).
  • Main function (we’ll talk more about it later).
  • Note curly braces and semicolon.
  • The :: symbol just means that there is a function called println in a namespace called std. For now just think about std::println the name of a function.
    • Standard library functions, classes, and objects exist in this std namespace.
  • String constructed with double quotes only (single quotes used for a single character).

About printing

  • std::println is a brand new part of the C++ standard, and not widely available in compilers as of 2023.

    • std::println always prints a new line at the end, use std::print if that is not desired.
  • A more traditional hello-world program looks like this:

    #include <iostream>
    
    int main()
    {
        std::cout << "Hello, World!\n";
    }
  • The << operator is used to “insert” the string to the cout stream; in other words just print it. We’ll talk more about streams at the end.

  • Here, we have to specify newline character (\n) explicitly if we need it.

📖 Extra information

This is a more detailed explanation of what is going on when running the g++ command. It is not critical to understand at this point.

  • g++ is the invocation command for the GNU Compiler Collection (also know as GCC)
    • There are many others compilers. GCC is reliable and commonly used.
  • What g++ does is quite complex, technically “compilation” is just one of several steps.
  • In a project with multiple .cpp files, the files are compiled separately and linked together into one executable at the end.
    • The .cpp files are called compilation units.
  • There are many parameters we may need to pass to g++, including where to find external libraries and, importantly, optimization flags.
  • Most projects use some build automation software to manage the build process.
    • Even for small projects it’s recommended to create a Makefile, a script used by GNU Make tool.
    • When the build is complex (especially when there are external dependencies), CMake can be used to create the Makefile.
    • That’s outside the scope of the workshop, but very important in the real world.
Build model

About portability

  • A Python program will run everywhere where an interpreter and required packages are installed.
  • C++ source code will compile and run everywhere where a compiler and required libraries are installed.
  • The C++ executable though will not necessarily run on systems other than the one it was built for.
    • It contains machine language instructions that are specific to (usually) one CPU architecture
    • The executable format and syscalls are operating system-specific.
    • The executable is usually dynamically linked to some libraries, and their presence in the system running the application is required.

The bottom line is that the executable is not very portable.

Variables

Unlike in Python, variables have to first be declared (with a type) before a value is assigned. They can (and should) be simultaneously initialized. For example (“copy initialization”):

int a = 5;

There are many other styles to do the same thing. Bjarne Stroustrup’s favourite way is “list initialization”:

int a {5};

It is almost equivalent to the first style for int.

The auto keyword can be used instead of the type:

auto a {5};
auto a = 5;

In this context, this keyword means “the type is whatever the type is on the right-hand side”, and should be used when we don’t exactly know or care what is on the right-hand side.

The variable declared like so still has a type (int in this case), it’s just deduced by the compiler. Where and how often to use auto is a personal choice.

⚠️ Don’t declare variables too far from where they are needed in the code.

† The difference is that list initializing disallows narrowing conversion.

Nature of variables in Python and C++

In short, Python variables are references to objects, C++ variables are the objects.

In Python, the statement x = 1234 creates an object somewhere in memory and the label x refers to that object. When we later modify the value (e.g. x = 5678 or x = 'hello'), another object is created somewhere else in memory, and the label x is reassigned to refer to that new object. The original object (holding the value 1234) is not necessarily immediately destroyed and its memory freed, but is garbage collected eventually.

In C++, the statement int x = 1234; (or its equivalents) creates an object somewhere in memory. In contrast with Python, modifying x actually modifies the value in the memory location where 1234 is stored; that is partly why you can’t use x for values of another type. The integer object is immediately destroyed and the memory freed when x goes out of scope.

You can also think about it as a different behaviour of the = operator in the two languages.

This difference will come up again when we talk about how to pass parameters to functions.

Scoping rules

In Python, resolution of what x actually refers to in any specific context follows the “LEGB” rule, and can get quite confusing.

Scoping rules for C++ are very simple: a variable x is accessible after it is defined (it can only be defined once), in the same scope and all inner scopes. The name x can be reused for a different variable in an inner scope (that is called shadowing and is a bad practice).

For example, this is not legal:

{
    int a = 5;
    float a = 1.234;
}

But this is fine:

{
    int a = 5;
}
{
    float a = 1.234;
}

Types

Similar to Python, we have

  • “Primitive” types: int, char, bool, float, & double (sometimes called fundamental types).
    • int is an integer type, unlike in Python, it has a fixed range.
      • The range is actually OS-dependent, but almost always ranges between approx. –2 and +2 billion (4 bytes or 32 bits).
      • The range can be adjusted by using the unsigned / signed and/or short / long type modifier keywords.
    • char is a numeric type with the range -128 to +127 (1 byte).
      • It can be marked unsigned to range between 0 and 255.
      • It’s used also to store a single ASCII character.
    • float and double are a 32- and 64-bit IEEE-754 floating point types.
      • Python’s float is usually 64 bit.
      • Many systems support long double. It may be 128-bit wide but that’s implementation-dependant.
  • User-defined types (classes, including standard library classes).
    • We’ll see standard library containers later which are important classes.

👉 Like in Python, classes are types, and the terms may be used interchangeably.

† To specify exactly what width of integer is needed, one can use fixed width integer types.

Type casting

⚠️ the following is legal but doesn’t work the same way as in Python:

{
    int a = 5;
    a = 1.234;
}

The value of a at the end of the block is 1! Since it can only hold an integer, the value 1.234 is truncated (a narrowing conversion). By default GCC won’t complain if we do that, but we can pass the -Wconversion flag to enable warnings. If we use list initialization (a = {1.234};) we’ll get an error.

What happens if we try to assign a string literal to int a? Casting is not possible in this case.

As a side note, conversion from int to float is also narrowing, because integral numbers larger than 224 may not be represented accurately by the 32-bit float type. Conversion from int to double, however, is not narrowing.

The “safe” way to cast from one type to another is using static_cast. In case the conversion is narrowing, it is a way to promise the compiler that we know what we’re doing, so it doesn’t raise errors or warnings. It could be useful to explicitly convert types, for example, when we divide one integer by another, it results in integer division like in old Python 2 (or the // operator in Python 3). To get around that, one of the numbers has to be cast (i.e. to a double).

auto a {3};
auto b {2};
auto c = a/b; // c is an integer with the value 1
auto d = static_cast<double>(a)/b; // d is a double with the value 1.5

const and constexpr

Variables of any type in C++ can be made read-only with the const keyword, and some types can also be constexpr.

constexpr is “stronger” because it implies that the value is not only unchangeable, but also known at compile time. It can still be a calculated value, but the result of the calculation has to be known at compile time.

Whenever we create a variable that is not expected to change, it’s strongly recommended to mark it constexpr or const (if not calculable at compile time).

This is important to help the compiler optimize the code, and may also prevent mistakes.

Functions

See example: examples/02_functions.cpp

  • A function has a signature: one return type and zero or more input types.
    • In Python it is also a good idea to declare input and output types!
  • In C++ it is perfectly fine to have functions with the same name but different signatures. That is called function overloading.
  • The return type can be “void” for a function that doesn’t actually return anything.
  • Default parameters (like in Python, have to come at the end).
  • If we want a function to return more than one number (or any other object), we can make the return type a tuple.
    • Notice the if-statement in the tuple example: if the body of a true/false clause or a loop has only one statement, the curly braces are optional.
  • Input parameters are passed by value (copy) by default, which can be very bad. More later about how to pass by reference.
  • Functions can be “anonymous” and may be used in place or assigned to variables, these are lambdas. We’ll see them later.

⚠️ If a function grows too big, has too many nesting levels, or too many parameters, it should probably be split into multiple smaller functions.

The main() function

  • As we’ve seen, this is the entry point to the program.
  • If we create an executable, we have to have main, otherwise (library) we shouldn’t have a function by that name.
  • The return type should be always be int, and upon successful termination should return zero (and an error code otherwise).
  • Unlike all other non-void functions, it’s OK to not have a return statement, in which case the exit code it zero.
  • main can have arguments, which is how command line parameters can be passed to the program.

Templates

This is an important and big topic in C++ but we won’t touch on it here beyond these notes. This idea doesn’t even exist in Python because Python programming is generic by design.

  • We can make a function generic by using auto as a placeholder for the type in the declaration.
  • This makes this function a template.
  • Templates are compiled as needed, as opposed to normal functions that are always compiled.
    • Meaning when it is used in the code (instantiated), the compiler see what the parameters types actually are and creates a specialization of the template.

To use (or instantiate) a template, if the types cannot be automatically deduced, they have to be specified in angle brackets <> as we will see in the examples.

📖 Extra information

The idea of templates goes well beyond a placeholder type.

  • Classes can also be templates.
  • Using auto like so is the abbreviated function template syntax.
  • There is also a “full” syntax for template declaration.
  • Templates should be used with concepts to reduce errors and increase readability.
  • Template meta-programming can get really complicated.

Standard containers

The C++ standard library provides some useful containers.

std::vector

This is the most important container, similar to a Python list, but all elements have the same type (so more similar to a NumPy 1d array). The type has to be indicated as a template parameter if it can’t be deduced. This container can be used as a stack (i.e. push and pop), and has random access.

See example: examples/03_vector.cpp

  • To use a vector we must #include the <vector> header.
  • The std::vector class has multiple constructors.
    • Template argument are sometimes needed to specify the value type.
  • The class has many useful methods: size, push_back, empty, and at.
  • Elements are accesses with the at method.

For a full list of constructors and other methods, see here.

⚠️ Accessing elements with square brackets [] like in Python is possible but not recommended because there are no bounds checks.

Other containers

  • std::array is the same as std::vector but with fixed size that is known at compile-time.
  • std::unordered_map is the equivalent of a Python dictionary.
    • There is also std::map, but the unordered version is usually what you want (performance differs).
  • std::unordered_set is the equivalent of a Python set.
  • std::valarray a bit old-fashioned but not deprecated, similar to a NumPy array in that it supports element-wise mathematical operations, slicing, and reductions. You can do all these with std::vector but need to manually define these operations. These are not as powerful as using a linear algebra library like Eigen and Armadillo.
  • There are many other containers, but the above cover almost all use cases.

References (and pointers)

See example: examples/04_references.cpp

  • References can be used to create an alias variable.
  • The real power is that if a function has a reference type in its signature, the variable is passed to it by reference, so copying is avoided.
  • The function my_func_by_value actually doesn’t do anything.
  • The function my_func_by_reference successfully mutates a in place.
    • In this case, the parameter is called an in/out parameter.
    • Mutating an in/out parameter is called a “side effect” of the function.
      • Functions without side effects are called “pure”.
      • Pure functions are easier for the compiler to optimize and for humans to understand.
    • If all the function needs to do is to mutate an int or something like that, better keeping it a pure function by just passing by value and returning the result.
      • If the goal is to mutate a few ints, prefer to return a tuple or a struct.

Parameters of “small” types (e.g. int, even double) can and should be passed by value, but anything bigger (e.g. standard containers) has to be passed by reference, and very often const reference if it needs not be modified.

Passing by reference in Python 🐍

In Python there is not much choice, everything is passed by reference. But remember that the assignment operator = reassigned the label. So if we have a function like:

def my_func_assignment(param):
    param = param * 2

It will not change the original a. It will create a new object with the value param * 2 and assigning it to a local variable param.

There is a subtle difference between param = param * 2 and param *= 2 though. For int and primitive types like that, it’s the same. But for classes, the augmented assignment operators (such as +=, *=, …) can be overloaded in such a way that they mutate the object. See examples/04_references.py for an example where an input parameter can be mutated if it is a class object.

Similarly, calling methods on an input variable of some class type may mutate the variable.

About pointers

  • Pointers are variables that hold memory addresses.
    • Of other variables, or elements in a container.
    • May point to manually managed memory.
  • They are very rarely needed in modern C++ proper, because:
    • References provide a very similar functionality.
    • With standard containers there is usually no need to manually managed memory.
  • Improper use of pointers leads to memory bugs like leaks and access violations (segmentation fault).
    • References are almost always safe (the main exception being a dangling reference, e.g. when a function returns a reference to a variable created in its scope).
  • Pointers are mostly useful when interacting with or wrapping a C library (since C has no reference types).
    • In C, using pointers is necessary in complex programs.
  • In situations where pointers are really needed, you should use smart pointers.
  • In Python you can get the memory address of a variable with the id builtin, but that’s about as far as Python goes in supporting pointers.

Loops

See example: examples/05_loops.cpp

  • Range-based for loops are the “workhorse” of C++.
    • Inside the parentheses, on the left is the range declaration. It’s similar to a variable declaration.
    • On the right after the colon is the range expression. In the first example it is just the container. In the second, it is the container modified by a range adaptor.
  • C-style loops just increment an index variable until some condition is met. The index appears three times in the loop’s header and it’s surprisingly easy to mess it up.
    • Notice the increment operator ++, that is where the language gets its names!
    • The meaning of idx++; is exactly the same as idx += 1;, which is also valid in C++.
    • It doesn’t have to be idx++ on the right, you can decrement with the -- operator, or use a custom stride.
  • std::views::iota is similar to range in Python, it is lazily evaluated (a “range factory” in C++ terminology).
    • Both arguments are needed to create a finite range!! std::views::iota(5) is an infinite series starting with 5.
    • iota is quite limited, can only go forward in increments of one.
    • The index can be declared const.
  • for_each is an algorithm from the standard library that executes some function for each element in the container.
    • It may or may not modify the element.
    • We used a lambda function as the second argument.
    • There are many more algorithms! We’ll see some of them later.

C++ practice

Monotonicity

Our goal is to write a function that accepts a std::vector of ints, and returns true if the sequence is strictly increasing, i.e. each number is bigger than the one before it in the sequence. Look at the program in file examples/06_monotonicity.py and translate it to C++. Things to keep in mind:

  • How to pass the input? By value, by reference, or const reference?
  • Which for loop-style is suitable here?
  • In C++, true and false are in lower case.
  • You can use words like not, and, & or like in Python. But it is more common to use the corresponding symbolic operators, which are !, &&, & ||, respectively.

Luhn algorithm

The last digit of debit/credit card numbers, as well as OHIP number, SIN and other identifiers, is actually a check digit. It’s purpose is to distinguish a valid number from mistyped or otherwise incorrect numbers. See Wikipedia for more details. The check digit is calculated from the other digits using a simple algorithm.

The Luhn algorithm: starting from the right side (excluding check digit) multiply every other digit by 2. If result if the multiplication is 10 or bigger, sum the two digits of the result (equivalently, subtract 9). Then, sum all results as sum. Finally, the check digit is (10 - (sum % 10)) % 10.

Example: 2445394258811369

We drop the last digit and start from the right (6) multiplying every other digit by 2 and following the other steps to get the answer:

2  4  4  5  3  9  4  2  5  8  8  1  1  3  6
↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓
4  4  8  5  6  9  8  2  10 8  16 1  2  3  12
↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓
4  4  8  5  6  9  8  2  1  8  7  1  2  3  3

sum = 71
check digit = (10 - (71 % 10)) % 10 = (10 - 1) % 10 = 9 % 10 = 9

So in this example, the number is valid as we retrieved the expected check digit. Look at the program in file examples/07_luhn.py and translate it to C++. Things to keep in mind:

  • Are there any variables that should be marked as const or constexpr?
  • An easy way to access the last element of a vector is with the .back() method.
    • You cannot use negative indices to access from the end like in Python.

In the solution file examples/07_luhn.cpp we used range adaptors to reverse the vector and drop the first element, then the view could be used in a range-based for loop. It’s perfectly acceptable doing the loop “C-style” or using iota but may be less elegant (you can for example start from the left, but pay attention on whether to start multiplying at the first or second digit from the left).

There is also a loop-free implementation example (examples/07_luhn_alternative.cpp) that makes heavy use of range adaptors.

More Standard Library

If we got so far, congratulations! That was the basic idea of how C++ works compared to Python. There is a lot more to learn, in particular about the standard library. We already saw standard library objects (cout), containers (vector), views (iota), and algorithms (for_each), that’s the tip of the iceberg.

Word frequencies

The goal is to count the number of times each word appears in a text file (The Complete Works of William Shakespeare). The Python version can be found in examples/08_word_frequencies.py. Let us take some time to understand it first before looking at the C++ version.

In this example we see:

  • Reading a text file into a string.
  • Using the for_each algorithm to modify a string in-place.
  • The using keyword to create type aliases.
  • The unordered_map container.
  • Reading from a (string) stream with the >> operator.
  • The condition of while loop also performs an operation that returns true or false (think := in Python)
  • The standard pair class.
  • Constructing a std::vector of pairs from a std::unordered_map.
  • The sort algorithm.
  • The take view (a range adaptor).
  • The | operator in ranges.
  • Structured binding.

Streams

The C++ standard library uses streams for text-oriented I/O. We already met cout which is a character stream (a global object), there are also file, string, and span streams that we can create as needed. The idea behind streams is to provide a common serial†† interface to format the data, regardless of what device is used to communicate them.

This is a bit boring, in real life you may not need to rely on streams so much, but it’s useful here when working with text.

In the example, the function read_text creates an input file stream f. The file is implicitly open when the object f is created, and closed automatically when it goes out of scope. In this case we read all of it into a buffer using the read method. The function count_appearances creates a string stream from a regular string, so we could read it word-by-word (whitespace delimited) using the >> operator. We could also have read the file stream in the same way, but chose to get the whole text as a string so it could be processed in-place first (punctuation removed, case lowered).

Alternative implementations

(1) There is also a range factory for streams (the istream view), we can use it to replace the while loop by a for_each call, if we really wanted to. Is it more readable though?

std::ranges::for_each(
  std::views::istream<std::string>(buffer),
  [&](const auto& word){
    word_counts[word]++;
  }
);

(2) In the Python version, we used the split method of the str class; in C++ there is a range adaptor std::views::split that we can use with the text string directly (no need for the stringstream buffer). Notice though that elements of this view are subranges rather than normal strings, so the loop could look like this:

for (const auto& word : std::views::split(text, ' '))
    word_counts[std::string(begin(word), end(word))]++;

This will not give us the same result unless additional text processing is done, since the range adaptor is dumb, and only splits with respect to the delimiter and not whitespaces in general (including newline characters).

† There are many ways to do the same thing in both Python and C++. Python has the collections.Counter class that is actually better for this task.

†† As opposed to random access.

Iris data set

The data set has measurement of 150 individual iris flowers of three species. The values in each row are separated by spaces: the first column is the species name, and it is followed by four numeric quantities (the sepal length, sepal width, petal length, and petal width in centimetres, however for our purposes it doesn’t matter what they are).

The goal is to calculate the averages of the four numeric quantities for each species separately. Preferably, we should do it without knowing the number of species, number of rows, and the number of numeric columns should be an adjustable parameter. The Python version can be found in exercises/01_iris_data.py.

If this is too difficult, you can try an “easy” version of the problem first: calculate the average of just one quantity (e.g. the first) for each species separately. The Python version of that can be found in exercises/01_iris_data_easy.py.

Tips

  • This time we don’t need a string stream, we can read directly from the file stream.
  • The sums map has key type of string.
    • In the “easy” version of the problem, the value type is float.
    • In the full version of the problem, the value type is a vector<float> or an array<float, n>.
  • You can use a while loop to go over the rows like in the example.
    • In the easy version, you can read a full row like so:

      std::string species;
      float datum, _;
      while (f >> species >> datum >> _ >> _ >> _) {
          /* ... */
      }

      The values we are not interested in will be read into the _ variable, which we’ll just ignore.

    • In the full version, first read the species name and then use an inner for loop to go over the numeric columns. The stream extraction in this case shouldn’t be in the condition of the while loop, instead, you can check if the stream has reached its end using the eof method, like so:

      while (!f.eof()) {
          f >> species;
          /* extract data using a for loop */
          /* ... */
      }
  • No need to sort this time, just loop over the map the same way we looped over word_counts_sorted in the word frequencies example. Even though that was a vector of pairs rather than a map, the for loop is the same.
    • The result may be printed in a different order than the Python solution, that is fine. Python dictionaries retain insertion order (since Python 3.7), C++ standard unordered maps do not.

The output should be something like:

setosa 5.006 3.428 1.462 0.246 
versicolor 5.936 2.77 4.26 1.326 
virginica 6.588 2.974 5.552 2.026

Next steps

🏁 There’s still a lot to learn. Some of the topics that may be of interest for scientific programmers include:

  • Multi-file projects and build tools
  • Classes (object oriented programming)
  • PyBind11
  • More standard libraries
  • Move semantics
  • Multi-threaded applications
    • Standard thread library
    • OpenMP is an alternative
  • Large-scale parallelism with MPI
  • Manual memory management and pointers
    • You will need them to interact with C libraries such as GSL and MPI
    • Some C++ APIs, like HDF5, are very old-fashioned and involve pointers to some extent
  • C++ Core Guidelines
  • External tools
Modifié le: jeudi 2 novembre 2023, 22:39