How you can train AI to convert design mockups into HTML and CSS

How you can train AI to convert design mockups into HTML and CSS

Within three years, deep learning will change front-end development. It will increase prototyping speed and lower the barrier for building software.

Photo by Wesson Wang on Unsplash

Currently, the largest barrier to automating front-end development is computing power. However, we can use current deep learning algorithms, along with synthesized training data, to start exploring artificial front-end automation right now.

In this post, we’ll teach a neural network how to code a basic HTML and CSS website based on a picture of a design mockup using deep-learning platform FloydHub. Here’s a quick overview of the process:

1) Give a design image to the trained neural network

2) The neural network converts the image into HTML markup

3) Rendered output

We’ll build the neural network in three iterations.

First, we’ll make a bare minimum version to get a hang of the moving parts. The second version, HTML, will focus on automating all the steps and explaining the neural network layers. In the final version, Bootstrap, we’ll create a model that can generalize and explore the LSTM layer.

All the code is prepared on Github and FloydHub in Jupyter notebooks. All the FloydHub notebooks are inside the floydhub directory and the local equivalents are under local.

The models are based on Beltramelli‘s pix2code paper and Jason Brownlee’s image caption tutorials. The code is written in Python and Keras, a framework on top of TensorFlow.

If you’re new to deep learning, I’d recommend getting a feel for Python, backpropagation, and convolutional neural networks. My three earlier posts on FloydHub’s blog will get you started:

Let’s recap our goal. We want to build a neural network that will generate HTML/CSS markup that corresponds to a screenshot.

When you train the neural network, you give it several screenshots with matching HTML.

It learns by predicting all the matching HTML markup tags one by one. When it predicts the next markup tag, it receives the screenshot as well as all the correct markup tags until that point.

Creating a model that predicts word by word is the most common approach today. There are other approaches, but that’s the method we’ll use throughout this tutorial.

Notice that for each prediction it gets the same screenshot. So if it has to predict 20 words, it will get the same design mockup twenty times. For now, don’t worry about how the neural network works. Focus on grasping the input and output of the neural network.

Let’s focus on the previous markup. Say we train the network to predict the sentence “I can code.” When it receives “I,” then it predicts “can.” Next time it will receive “I can” and predict “code.” It receives all the previous words and only has to predict the next word.

The neural network creates features from the data. The network builds features to link the input data with the output data. It has to create representations to understand what is in each screenshot, the HTML syntax, that it has predicted. This builds the knowledge to predict the next tag.

When you want to use the trained model for real-world usage, it’s similar to when you train the model. The text is generated one by one with the same screenshot each time. Instead of feeding it with the correct HTML tags, it receives the markup it has generated so far. Then, it predicts the next markup tag. The prediction is initiated with a “start tag” and stops when it predicts an “end tag” or reaches a max limit. Here’s another example in a Google Sheet.

“Hello World” Version

Let’s build a “hello world” version. We’ll feed a neural network a screenshot with a website displaying “Hello World!” and teach it to generate the markup.

First, the neural network maps the design mockup into a list of pixel values. From 0–255 in three channels — red, blue, and green.

To represent the markup in a way that the neural network understands, I use one hot encoding. Thus, the sentence “I can code” could be mapped like the below.

In the above graphic, we include the start and end tag. These tags are cues for when the network starts its predictions and when to stop.

For the input data, we will use sentences, starting with the first word and then adding each word one by one. The output data is always one word.

Sentences follow the same logic as words. They also need the same input length. Instead of being capped by the vocabulary, they are bound by maximum sentence length. If it’s shorter than the maximum length, you fill it up with empty words, a word with just zeros.

As you see, words are printed from right to left. This forces each word to change position for each training round. This allows the model to learn the sequence instead of memorizing the position of each word.

In the below graphic there are four predictions. Each row is one prediction. To the left are the images represented in their three color channels: red, green and blue and the previous words. Outside of the brackets are the predictions one by one, ending with a red square to mark the end.

green blocks = start tokens | red block = end token
#Length of longest sentence max_caption_len = 3 #Size of vocabulary vocab_size = 3
# Load one screenshot for each word and turn them into digits images = [] for i in range(2): images.append(img_to_array(load_img('screenshot.jpg', target_size=(224, 224)))) images = np.array(images, dtype=float) # Preprocess input for the VGG16 model images = preprocess_input(images)
#Turn start tokens into one-hot encoding html_input = np.array( [[[0., 0., 0.], #start [0., 0., 0.], [1., 0., 0.]], [[0., 0., 0.], #start <HTML>Hello World!</HTML> [1., 0., 0.], [0., 1., 0.]]])
#Turn next word into one-hot encoding next_words = np.array( [[0., 1., 0.], # <HTML>Hello World!</HTML> [0., 0., 1.]]) # end
# Load the VGG16 model trained on imagenet and output the classification feature VGG = VGG16(weights='imagenet', include_top=True) # Extract the features from the image features = VGG.predict(images)
#Load the feature to the network, apply a dense layer, and repeat the vector vgg_feature = Input(shape=(1000,)) vgg_feature_dense = Dense(5)(vgg_feature) vgg_feature_repeat = RepeatVector(max_caption_len)(vgg_feature_dense) # Extract information from the input seqence language_input = Input(shape=(vocab_size, vocab_size)) language_model = LSTM(5, return_sequences=True)(language_input)
# Concatenate the information from the image and the input decoder = concatenate([vgg_feature_repeat, language_model]) # Extract information from the concatenated output decoder = LSTM(5, return_sequences=False)(decoder) # Predict which word comes next decoder_output = Dense(vocab_size, activation='softmax')(decoder) # Compile and run the neural network model = Model(inputs=[vgg_feature, language_input], outputs=decoder_output) model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
# Train the neural network[features, html_input], next_words, batch_size=2, shuffle=False, epochs=1000)

In the hello world version, we use three tokens: start, <HTML><center><H1>Hello World!</H1></center></HTML> and end. A token can be anything. It can be a character, word, or sentence. Character versions require a smaller vocabulary but constrain the neural network. Word level tokens tend to perform best.

Here we make the prediction:

# Create an empty sentence and insert the start token sentence = np.zeros((1, 3, 3)) # [[0,0,0], [0,0,0], [0,0,0]] start_token = [1., 0., 0.] # start sentence[0][2] = start_token # place start in empty sentence # Making the first prediction with the start token second_word = model.predict([np.array([features[1]]), sentence]) # Put the second word in the sentence and make the final prediction sentence[0][1] = start_token sentence[0][2] = np.round(second_word) third_word = model.predict([np.array([features[1]]), sentence]) # Place the start token and our two predictions in the sentence sentence[0][0] = start_token sentence[0][1] = np.round(second_word) sentence[0][2] = np.round(third_word) # Transform our one-hot predictions into the final tokens vocabulary = ["start", "<HTML><center><H1>Hello World!</H1></center></HTML>", "end"] for i in sentence[0]: print(vocabulary[np.argmax(i)], end=' ')
  • 10 epochs: start start start
  • 100 epochs: start <HTML><center><H1>Hello World!</H1></center></HTML> <HTML><center><H1>Hello World!</H1></center></HTML>
  • 300 epochs: start <HTML><center><H1>Hello World!</H1></center></HTML> end

Mistakes I made:

  • Build the first working version before gathering the data. Early on in this project, I managed to get a copy of an old archive of the Geocities hosting website. It had 38 million websites. Blinded by the potential, I ignored the huge workload that would be required to reduce the 100K-sized vocabulary.
  • Dealing with a terabyte worth of data requires good hardware or a lot of patience. After having my mac run into several problems I ended up using a powerful remote server. Expect to rent a rig with 8 modern CPU cores and a 1GPS internet connection to have a decent workflow.
  • Nothing made sense until I understood the input and output data.The input, X, is one screenshot and the previous markup tags. The output, Y, is the next markup tag. When I got this, it became easier to understand everything between them. It also became easier to experiment with different architectures.
  • Be aware of the rabbit holes. Because this project intersects with a lot of fields in deep learning, I got stuck in plenty of rabbit holes along the way. I spent a week programming RNNs from scratch, got too fascinated by embedding vector spaces, and was seduced by exotic implementations.
  • Picture-to-code networks are image caption models in disguise. Even when I learned this, I still ignored many of the image caption papers, simply because they were less cool. Once I got some perspective, I accelerated my learning of the problem space.

Running the code on FloydHub

FloydHub is a training platform for deep learning. I came across them when I first started learning deep learning and I’ve used them since for training and managing my deep learning experiments. You can install it and run your first model within 10 minutes. It’s hands down the best option to run models on cloud GPUs.

Clone the repository

Login and initiate FloydHub command-line-tool

cd Screenshot-to-code-in-Keras floyd login floyd init s2c

Run a Jupyter notebook on a FloydHub cloud GPU machine:

floyd run --gpu --env tensorflow-1.4 --data emilwallner/datasets/imagetocode/2:data --mode jupyter

All the notebooks are prepared inside the FloydHub directory. The local equivalents are under local. Once it’s running, you can find the first notebook here: floydhub/Helloworld/helloworld.ipynb .

If you want more detailed instructions and an explanation for the flags, check my earlier post.

HTML Version

In this version, we’ll automate many of the steps from the Hello World model. This section will focus on creating a scalable implementation and the moving pieces in the neural network.

This version will not be able to predict HTML from random websites, but it’s still a great setup to explore the dynamics of the problem.


If we expand the components of the previous graphic it looks like this.

There are two major sections. First, the encoder. This is where we create image features and previous markup features. Features are the building blocks that the network creates to connect the design mockups with the markup. At the end of the encoder, we glue the image features to each word in the previous markup.

The decoder then takes the combined design and markup feature and creates a next tag feature. This feature is run through a fully connected neural network to predict the next tag.

Design mockup features

Since we need to insert one screenshot for each word, this becomes a bottleneck when training the network (example). Instead of using the images, we extract the information we need to generate the markup.

The information is encoded into image features. This is done by using an already pre-trained convolutional neural network (CNN). The model is pre-trained on Imagenet.

We extract the features from the layer before the final classification.

We end up with 1536 eight by eight-pixel images known as features. Although they are hard to understand for us, a neural network can extract the objects and position of the elements from these features.

Markup features

In the hello world version, we used a one-hot encoding to represent the markup. In this version, we’ll use a word embedding for the input and keep the one-hot encoding for the output.

The way we structure each sentence stays the same, but how we map each token is changed. One-hot encoding treats each word as an isolated unit. Instead, we convert each word in the input data to lists of digits. These represent the relationship between the markup tags.

The dimension of this word embedding is eight but often varies between 50–500 depending on the size of the vocabulary.

The eight digits for each word are weights similar to a vanilla neural network. They are tuned to map how the words relate to each other (Mikolov et al., 2013).

This is how we start developing markup features. Features are what the neural network develops to link the input data with the output data. For now, don’t worry about what they are, we’ll dig deeper into this in the next section.

We’ll take the word embeddings and run them through an LSTM and return a sequence of markup features. These are run through a Time distributed dense layer — think of it as a dense layer with multiple inputs and outputs.

In parallel, the image features are first flattened. Regardless of how the digits were structured, they are transformed into one large list of numbers. Then we apply a dense layer on this layer to form a high-level feature. These image features are then concatenated to the markup features.

This can be hard to wrap your mind around — so let’s break it down.

Markup features

Here we run the word embeddings through the LSTM layer. In this graphic, all the sentences are padded to reach the maximum size of three tokens.

To mix signals and find higher-level patterns, we apply a TimeDistributed dense layer to the markup features. TimeDistributed dense is the same as a dense…

Follow Me

Peter Bordes

Exec Chairman & Founder at oneQube
Exec Chairman & Founder of oneQube the leading audience development automation platfrom. Entrepreneur, top 100 most influential angel investors in social media who loves digital innovation, social media marketing. Adventure travel and fishing junkie.
Follow Me

More from Around the Web

Subscribe To Our Newsletter

Join our mailing list to receive the latest news from our network of site partners.

You have Successfully Subscribed!

Pin It on Pinterest