Hire a web Developer and Designer to upgrade and boost your online presence with cutting edge Technologies

Thursday, April 25, 2024

Converting Plain Text To Encoded HTML With Vanilla JavaScript

 What do you do when you need to convert plain text into formatted HTML? Perhaps you reach for Markdown or manually write in the element tags yourself. Or maybe you have one or two of the dozens of online tools that will do it for you. In this tutorial, Alexis Kypridemos picks those tools apart and details the steps for how we can do it ourselves with a little vanilla HTML, CSS, and JavaScript.

When copying text from a website to your device’s clipboard, there’s a good chance that you will get the formatted HTML when pasting it. Some apps and operating systems have a “Paste Special” feature that will strip those tags out for you to maintain the current style, but what do you do if that’s unavailable?

Same goes for converting plain text into formatted HTML. One of the closest ways we can convert plain text into HTML is writing in Markdown as an abstraction. You may have seen examples of this in many comment forms in articles just like this one. Write the comment in Markdown and it is parsed as HTML.

Even better would be no abstraction at all! You may have also seen (and used) a number of online tools that take plainly written text and convert it into formatted HTML. The UI makes the conversion and previews the formatted result in real time.

Providing a way for users to author basic web content — like comments — without knowing even the first thing about HTML, is a novel pursuit as it lowers barriers to communicating and collaborating on the web. Saying it helps “democratize” the web may be heavy-handed, but it doesn’t conflict with that vision!

Smashing Magazine comment form that is displayed at the end of articles. It says to leave a comment, followed by instructions for Markdown formatting and a form text area.
Smashing Magazine’s comment form includes instructions for formatting a comment in Markdown syntax. (Large preview)

We can build a tool like this ourselves. I’m all for using existing resources where possible, but I’m also for demonstrating how these things work and maybe learning something new in the process.

Defining The Scope

There are plenty of assumptions and considerations that could go into a plain-text-to-HTML converter. For example, should we assume that the first line of text entered into the tool is a title that needs corresponding <h1> tags? Is each new line truly a paragraph, and how does linking content fit into this?

Again, the idea is that a user should be able to write without knowing Markdown or HTML syntax. This is a big constraint, and there are far too many HTML elements we might encounter, so it’s worth knowing the context in which the content is being used. For example, if this is a tool for writing blog posts, then we can limit the scope of which elements are supported based on those that are commonly used in long-form content: <h1>, <p>, <a>, and <img>. In other words, it will be possible to include top-level headings, body text, linked text, and images. There will be no support for bulleted or ordered lists, tables, or any other elements for this particular tool.

The front-end implementation will rely on vanilla HTML, CSS, and JavaScript to establish a small form with a simple layout and functionality that converts the text to HTML. There is a server-side aspect to this if you plan on deploying it to a production environment, but our focus is purely on the front end.

Looking At Existing Solutions

There are existing ways to accomplish this. For example, some libraries offer a WYSIWYG editor. Import a library like TinyMCE with a single <script> and you’re good to go. WYSIWYG editors are powerful and support all kinds of formatting, even applying CSS classes to content for styling.

But TinyMCE isn’t the most efficient package at about 500 KB minified. That’s not a criticism as much as an indication of how much functionality it covers. We want something more “barebones” than that for our simple purpose. Searching GitHub surfaces more possibilities. The solutions, however, seem to fall into one of two categories:

  • The input accepts plain text, but the generated HTML only supports the HTML <h1> and <p> tags.
  • The input converts plain text into formatted HTML, but by ”plain text,” the tool seems to mean “Markdown” (or a variety of it) instead. The txt2html Perl module (from 1994!) would fall under this category.

Even if a perfect solution for what we want was already out there, I’d still want to pick apart the concept of converting text to HTML to understand how it works and hopefully learn something new in the process. So, let’s proceed with our own homespun solution.

Setting Up The HTML

We’ll start with the HTML structure for the input and output. For the input element, we’re probably best off using a <textarea>. For the output element and related styling, choices abound. The following is merely one example with some very basic CSS to place the input <textarea> on the left and an output <div> on the right:

See the Pen Base Form Styles [forked] by Geoff Graham.

You can further develop the CSS, but that isn’t the focus of this article. There is no question that the design can be prettier than what I am providing here!

Capture The Plain Text Input

We’ll set an onkeyup event handler on the <textarea> to call a JavaScript function called convert() that does what it says: convert the plain text into HTML. The conversion function should accept one parameter, a string, for the user’s plain text input entered into the <textarea> element:

<textarea onkeyup='convert(this.value);'></textarea>

onkeyup is a better choice than onkeydown in this case, as onkeyup will call the conversion function after the user completes each keystroke, as opposed to before it happens. This way, the output, which is refreshed with each keystroke, always includes the latest typed character. If the conversion is triggered with an onkeydown handler, the output will exclude the most recent character the user typed. This can be frustrating when, for example, the user has finished typing a sentence but cannot yet see the final punctuation mark, say a period (.), in the output until typing another character first. This creates the impression of a typo, glitch, or lag when there is none.

In JavaScript, the convert() function has the following responsibilities:

  1. Encode the input in HTML.
  2. Process the input line-by-line and wrap each individual line in either a <h1> or <p> HTML tag, whichever is most appropriate.
  3. Process the output of the transformations as a single string, wrap URLs in HTML <a> tags, and replace image file names with <img> elements.

And from there, we display the output. We can create separate functions for each responsibility. Let’s name them accordingly:

  1. html_encode()
  2. convert_text_to_HTML()
  3. convert_images_and_links_to_HTML()

Each function accepts one parameter, a string, and returns a string.

Encoding The Input Into HTML

Use the html_encode() function to HTML encode/sanitize the input. HTML encoding refers to the process of escaping or replacing certain characters in a string input to prevent users from inserting their own HTML into the output. At a minimum, we should replace the following characters:

  • < with &lt;
  • > with &gt;
  • & with &amp;
  • ' with &#39;
  • " with &quot;

JavaScript does not provide a built-in way to HTML encode input as other languages do. For example, PHP has htmlspecialchars(), htmlentities(), and strip_tags() functions. That said, it is relatively easy to write our own function that does this, which is what we’ll use the html_encode() function for that we defined earlier:

function html_encode(input) {
  const textArea = document.createElement("textarea");
  textArea.innerText = input;
  return textArea.innerHTML.split("<br>").join("\n");
}

HTML encoding of the input is a critical security consideration. It prevents unwanted scripts or other HTML manipulations from getting injected into our work. Granted, front-end input sanitization and validation are both merely deterrents because bad actors can bypass them. But we may as well make them work a little harder.

As long as we are on the topic of securing our work, make sure to HTML-encode the input on the back end, where the user cannot interfere. At the same time, take care not to encode the input more than once. Encoding text that is already HTML-encoded will break the output functionality. The best approach for back-end storage is for the front end to pass the raw, unencoded input to the back end, then ask the back-end to HTML-encode the input before inserting it into a database.

That said, this only accounts for sanitizing and storing the input on the back end. We still have to display the encoded HTML output on the front end. There are at least two approaches to consider:

  1. Convert the input to HTML after HTML-encoding it and before it is inserted into a database.
    This is efficient, as the input only needs to be converted once. However, this is also an inflexible approach, as updating the HTML becomes difficult if the output requirements happen to change in the future.
  2. Store only the HTML-encoded input text in the database and dynamically convert it to HTML before displaying the output for each content request.
    This is less efficient, as the conversion will occur on each request. However, it is also more flexible since it’s possible to update how the input text is converted to HTML if requirements change.\

Applying Semantic HTML Tags

Let’s use the convert_text_to_HTML() function we defined earlier to wrap each line in their respective HTML tags, which are going to be either <h1> or <p>. To determine which tag to use, we will split the text input on the newline character (\n) so that the text is processed as an array of lines rather than a single string, allowing us to evaluate them individually.

function convert_text_to_HTML(txt) {
  // Output variable
  let out = '';
  // Split text at the newline character into an array
  const txt_array = txt.split("\n");
  // Get the number of lines in the array
  const txt_array_length = txt_array.length;
  // Variable to keep track of the (non-blank) line number
  let non_blank_line_count = 0;
  
  for (let i = 0; i < txt_array_length; i++) {
    // Get the current line
    const line = txt_array[i];
    // Continue if a line contains no text characters
    if (line === ''){
      continue;
    }
    
    non_blank_line_count++;
    // If a line is the first line that contains text
    if (non_blank_line_count === 1){
      // ...wrap the line of text in a Heading 1 tag
      out += `<h1>${line}</h1>`;
      // ...otherwise, wrap the line of text in a Paragraph tag.
    } else {
      out += `<p>${line}</p>`;
    }
  }

  return out;
}

In short, this little snippet loops through the array of split text lines and ignores lines that do not contain any text characters. From there, we can evaluate whether a line is the first one in the series. If it is, we slap a <h1> tag on it; otherwise, we mark it up in a <p> tag.

This logic could be used to account for other types of elements that you may want to include in the output. For example, perhaps the second line is assumed to be a byline that names the author and links up to an archive of all author posts.

Tagging URLs And Images With Regular Expressions #

Next, we’re going to create our convert_images_and_links_to_HTML() function to encode URLs and images as HTML elements. It’s a good chunk of code, so I’ll drop it in and we’ll immediately start picking it apart together to explain how it all works.


function convert_images_and_links_to_HTML(string){
  let urls_unique = [];
  let images_unique = [];
  const urls = string.match(/https*:\/\/[^\s<),]+[^\s<),.]/gmi) ?? [];
  const imgs = string.match(/[^"'>\s]+\.(jpg|jpeg|gif|png|webp)/gmi) ?? [];
                          
  const urls_length = urls.length;
  const images_length = imgs.length;
  
  for (let i = 0; i < urls_length; i++){
    const url = urls[i];
    if (!urls_unique.includes(url)){
      urls_unique.push(url);
    }
  }
  
  for (let i = 0; i < images_length; i++){
    const img = imgs[i];
    if (!images_unique.includes(img)){
      images_unique.push(img);
    }
  }
  
  const urls_unique_length = urls_unique.length;
  const images_unique_length = images_unique.length;
  
  for (let i = 0; i < urls_unique_length; i++){
    const url = urls_unique[i];
    if (images_unique_length === 0 || !images_unique.includes(url)){
      const a_tag = `<a href="${url}" target="_blank">${url}</a>`;
      string = string.replace(url, a_tag);
    }
  }
  
  for (let i = 0; i < images_unique_length; i++){
    const img = images_unique[i];
    const img_tag = `<img src="${img}" alt="">`;
    const img_link = `<a href="${img}">${img_tag}</a>`;
    string = string.replace(img, img_link);
  }
  return string;
}

Unlike the convert_text_to_HTML() function, here we use regular expressions to identify the terms that need to be wrapped and/or replaced with <a> or <img> tags. We do this for a couple of reasons:

  1. The previous convert_text_to_HTML() function handles text that would be transformed to the HTML block-level elements <h1> and <p>, and, if you want, other block-level elements such as <address>. Block-level elements in the HTML output correspond to discrete lines of text in the input, which you can think of as paragraphs, the text entered between presses of the Enter key.
  2. On the other hand, URLs in the text input are often included in the middle of a sentence rather than on a separate line. Images that occur in the input text are often included on a separate line, but not always. While you could identify text that represents URLs and images by processing the input line-by-line — or even word-by-word, if necessary — it is easier to use regular expressions and process the entire input as a single string rather than by individual lines.

Regular expressions, though they are powerful and the appropriate tool to use for this job, come with a performance cost, which is another reason to use each expression only once for the entire text input.

Remember: All the JavaScript in this example runs each time the user types a character, so it is important to keep things as lightweight and efficient as possible.

I also want to make a note about the variable names in our convert_images_and_links_to_HTML() function. images (plural), image (singular), and link are reserved words in JavaScript. Consequently, imgs, img, and a_tag were used for naming. Interestingly, these specific reserved words are not listed on the relevant MDN page, but they are on W3Schools.

We’re using the String.prototype.match() function for each of the two regular expressions, then storing the results for each call in an array. From there, we use the nullish coalescing operator (??) on each call so that, if no matches are found, the result will be an empty array. If we do not do this and no matches are found, the result of each match() call will be null and will cause problems downstream.

const urls = string.match(/https*:\/\/[^\s<),]+[^\s<),.]/gmi) ?? [];
const imgs = string.match(/[^"'>\s]+\.(jpg|jpeg|gif|png|webp)/gmi) ?? [];

Next up, we filter the arrays of results so that each array contains only unique results. This is a critical step. If we don’t filter out duplicate results and the input text contains multiple instances of the same URL or image file name, then we break the HTML tags in the output. JavaScript does not provide a simple, built-in method to get unique items in an array that’s akin to the PHP array_unique() function.

The code snippet works around this limitation using an admittedly ugly but straightforward procedural approach. The same problem is solved using a more functional approach if you prefer. There are many articles on the web describing various ways to filter a JavaScript array in order to keep only the unique items.

We’re also checking if the URL is matched as an image before replacing a URL with an appropriate <a> tag and performing the replacement only if the URL doesn’t match an image. We may be able to avoid having to perform this check by using a more intricate regular expression. The example code deliberately uses regular expressions that are perhaps less precise but hopefully easier to understand in an effort to keep things as simple as possible.

And, finally, we’re replacing image file names in the input text with <img> tags that have the src attribute set to the image file name. For example, my_image.png in the input is transformed into <img src='my_image.png'> in the output. We wrap each <img> tag with an <a> tag that links to the image file and opens it in a new tab when clicked.

There are a couple of benefits to this approach:

  • In a real-world scenario, you will likely use a CSS rule to constrain the size of the rendered image. By making the images clickable, you provide users with a convenient way to view the full-size image.
  • If the image is not a local file but is instead a URL to an image from a third party, this is a way to implicitly provide attribution. Ideally, you should not rely solely on this method but, instead, provide explicit attribution underneath the image in a <figcaption>, <cite>, or similar element. But if, for whatever reason, you are unable to provide explicit attribution, you are at least providing a link to the image source.

It may go without saying, but “hotlinking” images is something to avoid. Use only locally hosted images wherever possible, and provide attribution if you do not hold the copyright for them.

Before we move on to displaying the converted output, let’s talk a bit about accessibility, specifically the image alt attribute. The example code I provided does add an alt attribute in the conversion but does not populate it with a value, as there is no easy way to automatically calculate what that value should be. An empty alt attribute can be acceptable if the image is considered “decorative,” i.e., purely supplementary to the surrounding text. But one may argue that there is no such thing as a purely decorative image.

That said, I consider this to be a limitation of what we’re building.

Displaying the Output HTML

We’re at the point where we can finally work on displaying the HTML-encoded output! We’ve already handled all the work of converting the text, so all we really need to do now is call it:

function convert(input_string) {
  output.innerHTML = convert_images_and_links_to_HTML(convert_text_to_HTML(html_encode(input_string)));
}

If you would rather display the output string as raw HTML markup, use a <pre> tag as the output element instead of a <div>:

<pre id='output'></pre>

The only thing to note about this approach is that you would target the <pre> element’s textContent instead of innerHTML:

function convert(input_string) {
  output.textContent = convert_images_and_links_to_HTML(convert_text_to_HTML(html_encode(input_string)));
}

Conclusion 

We did it! We built one of the same sort of copy-paste tool that converts plain text on the spot. In this case, we’ve configured it so that plain text entered into a <textarea> is parsed line-by-line and encoded into HTML that we format and display inside another element.

See the Pen Convert Plain Text to HTML (PoC) [forked] by Geoff Graham.

We were even able to keep the solution fairly simple, i.e., vanilla HTML, CSS, and JavaScript, without reaching for a third-party library or framework. Does this simple solution do everything a ready-made tool like a framework can do? Absolutely not. But a solution as simple as this is often all you need: nothing more and nothing less.

As far as scaling this further, the code could be modified to POST what’s entered into the <form> using a PHP script or the like. That would be a great exercise, and if you do it, please share your work with me in the comments because I’d love to check it out.

References

No comments:

Post a Comment