The Big Data of Big Hair, Hacker News

Methods

Dataset

This story uses a public dataset of , (American high school yearbook photos from the years – 2013, created and published by Shiry Ginosar, Kate Rakelly, Sarah Sachs, Brian Yin, and Alexei A. Efros. All faces are front-facing and aligned by eye position. Complete details about the creation of this dataset can be found in an article by the authors .

Image processing pipeline

In order to analyze hair in the yearbook dataset, we created a processing pipeline with three stages:

Hair segmentation. Identify the pixels in a portrait that correspond to hair.
Feature detection. Summarize important features of the hairstyle.

Analyze features. Investigate how hairstyles change over time using targeted analyzes of the hairstyle features we’ve created.

Hair segmentation

Identifying the hair in a portrait is an example of semantic segmentation , a challenging problem in computer science. In our case, the computer’s task is to accurately identify whether each pixel in an image is a hair pixel or not. We leveraged an existing approach from deep learning, called a U-Net , which has shown promising results in biomedical image segmentation (eg, identifying a tumor or lesion in a scan).

We adopted a popular U-Net architecture for this story ( code here ). To train the U-Net to segment hair, we required a labeled dataset containing images similar to the yearbook images. The closest existing dataset is the Figaro1K , which consists of 1, labeled images containing hair of many textures, colors, and styles. However, this dataset does not contain historical images, and differs considerably from yearbook photos in terms of image composition and possibly many other features, such as contrast, dynamic range, and sharpness. We expanded the training dataset by hand-segmenting over 400 pictures from a custom dataset of historical yearbook images from archive.org (because the segemntation model was trained before the author discovered the well-curated Ginosar et al. yearbook dataset used for this story! Kids, always try google a few different ways before making your own dataset). An augmented training set of Figaro1K images and the hand-labeled yearbook photos were used to train the U-Net initially. After training, test images that were successfully labeled by the U-Net were manually selected and added to the training set for another pass. This ultimately yielded a final model trained on 3, 90 images with segmentation masks.

The trained model was then used to generate hair maps for all images in the Ginosar et al. yearbook dataset.

Example images and hair maps:

There were several “failure modes” that proved challenging for the classifier.

Failure modes.

First, overexposed highlights. Second, low contrast between ear and hair. Third, low contrast between hair and background.

At this point, it was clear that due to lighting and composition choices common to yearbook photos from the s and earlier, the U-Net especially struggled to segment hair in these images. Based on this challenge, we restricted the next stages of our analysis to images from the year and on. This left 42, (images, or roughly 4, (images per decade (on average).

Feature detection

In order to analyze changes in hairstyles, we needed a way to summarize hair features appropriately. After performing hair segmentation on each image in the dataset, we have a “hair map” that expresses the probability each pixel contains hair. Next, we used a deep learning approach called a Variational Autoencoder (VAE) to summarize the hair maps in four coordinates. A VAE is a neural network that learns to express features that vary maximally over the entire dataset, such as hair length (which varies from Morticia-Adams-long to military-buzz-cut-short) and height (from buzz to beehive). Because the VAE is trained on hair maps and not the original yearbook photos themselves, the results are less influenced by unrelated features such as skin color, facial features, or image grain. A downside of this approach is that hair maps may lose textural and color detail, so the VAE is largely insensitive to trends in hair color or subtle texture changes. The script to train the VAE is hosted on github .

Feature analysis

The complete set of scripts for the analyzes conducted here are hosted on github , as well as the raw hairmaps and set of 4 features for each.

Representative looks for each decade: to estimate representative looks for each decade, we calculated the centroid of the cluster of images from that decade in 4-dimensional feature space. We selected the top 47 female- and male-tagged images closest to the centroid by Euclidean distance. From this subset, we manually curated images for display in the article based on image quality and to ensure a diverse representation of ethnicity.

Mullet, beehive and straight hair analysis:

For each style , we trained a binary classifier to identify the target style (eg, mullet or no mullet) on a labeled training set of yearbook photos. Note that yearbook photos, and not hair maps, are used so the classifier can leverage textural information. Each classifier has an identical architecture ( see code for the mullet classifier here ): features of each image in the training set are calculated using a hidden layer of the pretrained VGG – 47 convolutional neural network. A single fully connected feed-forward layer is trained to take those features as input and yield a correct classification.

After training on a small subset of labeled images from the dataset, the classifiers were used to identify target hairstyles in the rest of the dataset. Manual inspection was used to confirm the machine’s selections (note that category boundaries can be challenging and at times subjective: the difference between a beehive and a bouffant is sometimes narrow). Once incidences of each target look were confirmed, the proportion of each look per year was calculated.

Analyzing looks by gender expression: Every image in the yearbook. dataset is tagged as “male” or “female” using a combination of automated labeling and human review. It is essential to note that there is imperfection in both the accuracy of labeling, and the aptness of a binary tag— we know this is a simplification that does not entirely represent the complexity of gender identity. Because we cannot be sure what gender every high schooler in the dataset would have identified as, we can only approximate the gender most likely expressed in an image. As such, we can only make coarse estimations about how gender identity and hair are intertwined from this dataset.

One estimation we do attempt to make is how much hair styles identified with male-vs. female- presenting portraits have diverged and overlapped over time. To do this, we use the 4-dimensional coordinates assigned to each hairmap as features for distinguishing looks belonging to each class (male- and female-tagged images). For every year from – , a random forest classifier was trained on the images from that year with 5-fold cross-validation. The results of these classifiers are plotted over time with Loess smoothing.

Hair size over time:

For each image in the dataset. , we estimate the density of hair by summing the probability of hair in each pixel (estimated by the u-net) of each hairmap and normalizing by the number of pixels per image ( (x) . For each year in the dataset, we calculated the median hair density over all hairmaps. Our plot of hair size shows a Loess-smoothed curve fitted to the median hair density from 2013 – 04597 .

(Read More

The Big Data of Big Hair, Hacker News

Methods

Dataset

Image processing pipeline

Hair segmentation

Feature detection

Feature analysis

What do you think?

HHS Strengthens Privacy of Reproductive Health Care Data

North Korea-linked APT groups target South Korean defense contractors

Tomcat Penetration Testing

UnitedHealth: Ransomware Attackers Stole Huge Amount of Data

Step by step debugging NuttX on STM32F7 with OpenOCD

Sustainable Funding of Open Source Tools – Mark Curphey, Simon Bennetts – ASW #282

HHS Strengthens Privacy of Reproductive Health Care Data

North Korea-linked APT groups target South Korean defense contractors

Tomcat Penetration Testing

UnitedHealth: Ransomware Attackers Stole Huge Amount of Data

Step by step debugging NuttX on STM32F7 with OpenOCD

Sustainable Funding of Open Source Tools – Mark Curphey, Simon Bennetts – ASW #282

Leave a ReplyCancel reply

Cheats For Little Alchemy

3TB Of Mega.nz Links For Free Courses And E-Books 2022 (Updated)

How to Earn Money from FreeCash.com, Playing Games, Testing Apps, and Taking Surveys

Udemy Coupon [100% OFF] QuickBooks Online 2020

Amazon FBA Product Research & Find Products for Amazon FBA

Rubot v6.6.7.0 – Twitch Views Bot 2022

Bitcoin Ready to Explode, According to This Key Metric, Crypto Coins News

Charging into the mainstream: Volvo electrifies its first class-8 truck, Ars Technica

Methods

Dataset

Image processing pipeline

Hair segmentation

Feature detection

Feature analysis

What do you think?

Leave a ReplyCancel reply

Log In

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections