Hire a web Developer and Designer to upgrade and boost your online presence with cutting edge Technologies

Thursday, December 8, 2022

Using characters, malloc, memcpy and compiling in C in Visual Studio (aka weird stuff that can happen when adding strings)

 

(I am trying out a new theme today called Crayon syntax highlighter which should hopefully make some of these coding examples I do look a bit prettier. I set it to the Visual Studio 2022 theme, which means using Consolas as the font. So lets see how this goes!)

Today I will show you what happens when you do something totally logical but, actually, not in the correct way (and arguably not even a way in which you should do it at all, but we will have to ignore that for now). First of all, some background; in an attempt to add together some strings of an undetermined length, I created a function which did some very bizarre things. I will go through some background of using characters and lead onto what happened. So as a really basic example, we can just make a new C++ console project in Visual Studio and do the following in a main .cpp;

Characters

#include "stdafx.h"
int _tmain(int argc, _TCHAR* argv[])
{
	firstname = "Arseniy ";
	surname = "Yatsenyuk";

	addstrings(&name, firstname, surname);
	return 0;
}

We basically have two strings of text that represent a first name and a surname (I was going to choose Putin but I decided that Yatsenyuk is a good guy at this point in time and I feel quite positive writing this).

Important note here: a string is an array of ascii characters. Each character is one byte and can be anywhere from 0 to 255 in value. Actually, it is an array of characters, terminated with a null termination character. This means we just fill up a character, or an array of characters, with numbers that correspond to text (or escape characters, such as the termination character).

Oh yeah, we need to add three strings too, so if we declare them outside of the main function, we can use them everywhere in our program;

char * name;
char * firstname;
char * surname;

So what we do here is to declare three character pointers. Which can represent a string. What does this do? It allocates a pointer to a memory location that will represent a character (which is a single byte of data). Why is this important? Because this post relies on using memory addresses quite a lot and understanding how pointers work is really important in this basic example. Or rather, this basic example should be able to demonstrate using pointers a bit more.

If we then look back at what we are doing by giving a value to “firstname”, we are actually not giving it a single value (which would be written using a single-quote, for example ‘a’ or ‘H’) but a string of values.

What this statement does is to create a new array of characters, of the correct length (plus one, actually, as it will add a null-termination character onto the end of the array – which is a ‘\0’. The length, therefore, is going to be one character more than you would think it should be!). It does this by assigning some new memory to an array of characters and, because the variable “firstname” is a pointer, what we actually get back is just that – a pointer to the memory location where this new array of characters is stored. So “firstname” does not store any text; it stores the memory address of where the character array actually resides in memory.

So (firstname = “Arseniy “) is an initialisation function, returning the address of “A” as its result. To access any letter, we can just say “firstname[3]” to access the “e” – or we can say “firstname + 3”.

Now hold up a minute, you may rightly say. Adding to an array, what?! Well that would be wrong entirely; you are not adding to an array, because “firstname” is not an array. It is a pointer to a location in memory. So what you actually are doing, is referring to a place in memory that is three values higher than “firstname”. Which is exactly what “firstname[3]” refers to, as well. But if we use this memory address, we will receive the value stored there; which is “e”.

Alternatively we could initialise both of these as arrays in the first place.

char firstname[] = "Arseniy "
char surname[16] = "Yatsenyuk"

These are analogous to doing the following:

char firstname[9] = {'A', 'r', 's', 'e', 'n', 'i', 'y', ' ', '\0'};
char surname[16] = {'Y', 'a', 't', 's', 'e', 'n', 'y', ' u ', 'k', 0, 0, 0, 0, 0, 0, 0 };

There is a difference here, which is that when we specify an array size ourselves, assigning the string to it will not terminate it (and the rest of the values are 0s). However, if we give the array no size, it will be sized as the length of the string we give it plus one (for the termination character). Sometimes we might not want the termination character – indeed, when you are combining strings, you might not want to include this character too!

However, we don’t know necessarily what we are going to do with the character arrays; especially in the case of “name”, which will combine the two. We won’t know until runtime how big this will be so we have to use pointers in order to allocate memory on the heap. This is memory for dynamically allocated resources and, because we have no idea what memory we will necessarily be given at runtime (and it changes depending on system availability), we need to use pointers to store the memory as it is allocated.

(It is worth noting that passing arrays of characters as parameters in functions will pass them by reference by default. This is important to note, because any changes you make within that function will affect the original data being passed in; so to prevent against this, we will specify the signature as being const char * rather than simply char*. What actually will get passed is not the array but the pointer to the first address of the array – even if an array size is specified in the function signature, it is just ignored)

The function

Ok so now we have our character pointers made, let’s look at that actual “addstrings” function. Warning: anyone who really really knows C or C++ will hate how I have written this.

void addstrings(char ** result, char * firstpart, char * secondpart)
{
	int x = strlen(firstpart);
	int y = strlen(secondpart);

	*result = (char *)malloc((x + y + 1) * sizeof(char));

	memcpy(*result,			firstpart,		x);
	memcpy(*result + x,		secondpart,		y);

	*result[x + y] = '\0';
}

What we do here is:

  • Take a pointer to a character pointer (a pointer to a pointer, yes)
  • Take a pointer to a character, twice
  • Calculate the length of those two character pointers
  • Assign a new memory address to the dereferenced pointer which is what we are  going to be using to store our final string in
  • Copy the contents of the first character pointer into this new memory address for its entire length
  • Copy the contents of the second character pointer as well, but starting from the address of the first one’s length, otherwise we just overwrite what we already put into the new pointer.
  • Finally, we add a null terminating character onto the end

So it isn’t terribly complex if you can just remember that all a pointer is is a 4 byte piece of data that contains an address of memory where some data resides. Then, when working with pointers, we can copy addresses and contents of addresses over.

Pointers to pointers

Let’s break this down a bit further. The first thing we have is our “double” pointer. The reason for this is because we have no pointer – I mean, the pointer we call “name” has not been initialised. It has an address of 0 – that is, it has no address at all. It is just nothing. So we want to initialise it. We can’t just pass it as we do the other strings – because they have been initialised with a memory address. So we have to pass a reference to the pointer (which is done with the ampersand symbol), which translates as a pointer to a pointer.

Dereferencing

Now we want to dereference it. This means that, when we modify this reference, we want to modify the data that the reference points to (just like a pointer because.. well.. it is). In this case, we have to say ” *result ” – the asterisk as a prefix now acts as a dereferencing operator. So, instead of the memory address of the passed-in reference, we have access to the memory address of the actual pointer.. which is 0.

Malloc

But that is fine, because we are going to give it a value. And this value is the memory address of a brand new set of data – totally empty, of course, but enough to support the length of the first set of characters, the second set of characters plus a terminating character we will manually add (so x, y and 1). These numbers added together will tell malloc how many bytes to allocate; so if we were doing an allocation of ints, we would need that many ints multiplied by how many bytes an int represents. This is why I added sizeof(char) – it still represents 1, but it emphasises that you would use sizeof() to get the length of a datatype (note that if you were to do sizeof on an array, it will return the size of the array and not the length of a string of characters, which could be much shorter).

It is at this point that I should point out a difference between C and C++.

In C, you don’t actually need to convert the pointer that is returned by malloc into a char. Malloc will do this anyway. I personally liked to do this to get exactly what I want but apparently this is just something that people don’t like to see. But ignoring all of the supposed repetition and clutter you get from just adding the casting from a void pointer to a character pointer (and really, a pointer is a pointer.. the only thing the data type does is to signify how big an element is that the pointer points to), there is a genuine concern. It turns out malloc returns an int – but if you include stdlib.h, malloc apparently will convert whatever you are asking it to make into where you are trying to store it. If you explicitly cast it to (char *), you won’t receive an error that would relate to stdlib not being included – because you are doing what the malloc definition in stdlib would have done.

Lets try this out and take out the implicit cast and also take out using stdlib and see what happens:

mallocerror

 

 

 

 

 

Ok so the first problem is that malloc isn’t even found. Hmm. Lets add stdlib.h back in and see what happens.

mallocerror2

An error, still? I thought we can do implicit conversion! But wait – we are still using the C++ compiler…

Compiling using C

Here we go. How to compile using C++ in Visual Studio. Under the project properties, we can go to the Advanced properties under C/C++. Here we can actually tell it to just use the C compiler

compile as C

And now the errors go away if we include stdlib.h, regardless of whether or not we add (char *) before malloc to convert the pointer to a char* pointer. So lets go back and get rid of stdlib.h:

stdlib

In the first instance, we will get two warnings, but if we add a (char *) conversion, we only get one. Interestingly, we only get warnings and furthermore, we are warned that malloc isn’t defined in either case. But in neither case are we warned about not having stdlib.h included. Perhaps this is related to Microsoft’s C compiler that ships with Visual Studio 2013 SP4 – but it shows how you can actually specify to use the C compiler instead of C++.

Secondly, it is probably worth mentioning that malloc isn’t favoured to be used in C++. One difference is that if you want to deallocate memory allocated by malloc, you have to use free() – in C++, you would use the “new” and “delete” keywords. You can still use malloc, but the std library has functions for handling strings so there isn’t really a reason not to use this instead, except in the case where you might use a string that isn’t a string (u_char isn’t considered a string..). Additionally, malloc simply assigns memory and memcpy copies memory but C++’s new operator will call a constructor. C++’s object-oriented functionality like this is useful and we can say we want a new string if we want to do the above. So there are some important differences here to be aware of but really if you use C then there is no reason not to just use malloc without a conversion and if you use C++ then you can just use “string” types instead.

Anyhow, onto what happens with the code for a final, interesting, look at a problem!

The problem

Ok, so if we take this code and run it as we have made it above, it should work.

#include "stdafx.h"
#include <stdlib.h>
#include <string.h>

char * name;
char * firstname;
char * surname;

void addstrings(char ** result, char * firstpart, char * secondpart)
{
	int x = strlen(firstpart);
	int y = strlen(secondpart);


	*result = (char *) malloc((x + y + 1) * sizeof(char));

	memcpy(*result,		firstpart,	x);
	memcpy(*result + x, secondpart, y);
	  
	*result[x + y] = '\0';
}



int _tmain(int argc, _TCHAR* argv[])
{
	firstname = "Arseniy ";
	surname = "Yatsenyuk";

	addstrings(&name, firstname, surname);
	return 0;
}

The problem is, is that you get an exception trying to write to memory address 0 when trying to set this line:

*result[x + y] = '\0';

Everything should work fine. Bizarrely, it doesn’t. Everywhere we have dereferenced “result” has worked fine – even examining the variables shows that result points to a pointer that contains the text we want. But we can’t seem to set anything.

p1

This had me going mental for absolutely ages. It made no sense as to what was going on. However, the first hint as to what it could be came when I just changed the array index that I was trying to alter. If I set it to 0, it cleared the value of result (and, subsequently, the global variable it pointed to). If I set it to 1, something surprising happened.

p2

Still an error, but the address was no longer 0x00000000. On inspection, it turns out that the memory address of position 1 corresponds to the memory address of “firstpart” (which is “firstname”). Sure enough, position 2 of the array corresponds to the memory address of “secondpart”.

 

 

p3

In fact, setting result[2] to ‘\0’, without the dereference, will wipe the secondpart pointer entirely (this is exactly why const char pointers are important, to avoid accidentally doing what I just did here!). So what exactly is *result[2] doing?

It turns out that this is to do with the order in which the statement is evaluated. All along, the assumption has been that these two statements were the same thing:

*result[x + y] = '\0';	
(* result) [x + y] = '\0'

 

When actually, the following is the case:

*(result[x + y]) = '\0';

 

And what this means is that, all along, we have been de-referencing whatever is located at each part of the pointer reference. The only valid thing to do this to would be results[0] – as this is a pointer to a pointer; but results[1] is a pointer which is already dereferenced. What we actually wanted to do, was to dereference the pointer-pointer first and then access a location relative to that.

The lesson is:

It turns out that

*result[x + y] = '\0';

Is actually saying:

* (result[x + y]) = '\0';

And therefore, I have to explicitly state:

(* result) [x + y] = '\0';

Array notation takes precedence over dereferencing.

And there you have it. If this ever happens to you, make sure that you use brackets to contain whatever you are trying to say or do in this way. If anyone has any questions or issues with the code and statements above, feel free to leave them in a comment below!

No comments:

Post a Comment