How Artificial Intelligence “Sees” Art – Mount Holyoke College Art Museum

Patricia A. Banks, Professor of Sociology
Delaney Burkhard ’25
Alison Foong ’25
Isabella Le ’27
Madison Newsham ’25

Figure 1: Print in the MHCAM collection that is annotated with the “person” label with .83 confidence by YOLO11; Henri Fantin-Latour, [woman leaning on podium], 1887

In Spring 2025, I offered a new course in the sociology department at Mount Holyoke College: Visual Sociology. Visual sociology is a class that focuses on how sociologists engage with visual material in the study of society. As part of their coursework students completed a project focusing on object detection in visual culture.

Determining if art depicts entities such as people and animals is a need often shared by sociologists studying visual culture and museum practitioners. For cultural sociologists, identifying people in visual culture is a crucial first step in developing theories around topics such as identity and art patronage. For example, in my book Represent: Art and Identity Among the Black Upper Middle-Class, I examine how identity is associated with the types of figurative art that people display in their homes. Similarly, identifying animals in visual culture is a necessary step in sociological research investigating topics such as the representation of animals in the media. For museums, identifying people and animals in art is often a part of tagging which is used to make collections more accessible.

Traditionally, social scientists and museum professionals have identified people and animals in visual culture by hand coding. Hand coding can be time-consuming since often hundreds or thousands of images need to be labeled. One approach that can potentially address some of the limitations of hand coding is object detection. Object detection is a computer vision technique that uses machine learning algorithms to identify and locate specific objects within digital images and video frames. Increasingly, social scientists and museums are using object detection to describe and analyze visual culture.

Nakayama Sugakudo, [No. 23: Mashigi, futoi, hirugao [Snipe, Club-Rush, and Morning Glories], from the series Sho utsushi yonju-hachi taka [Forty-eight Hawks Drawn from Life], 1859, annotated — Figure 2: Print in the MHCAM collection that is annotated with the “bird” label with .32 confidence by YOLO11; Nakayama Sugakudo, [No. 23: Mashigi, futoi, hirugao [Snipe, Club-Rush, and Morning Glories], from the series Sho utsushi yonju-hachi taka [Forty-eight Hawks Drawn from Life], 1859

As object detection gains traction in these fields, it’s critical to better understand both the conditions under which it accurately identifies people and animals and the costs and benefits of implementation. The “Visual Recognition in Art Project: Traditional Approaches vs. AI,” which students worked on in Visual Sociology, aims to help address these knowledge gaps. Our research examines how hand labeling by visual sociology students compares to object detection and museum tags in identifying people and animals in 19th-century prints from the Mount Holyoke College Art Museum (MHCAM).

The object detection model we used is YOLO11. YOLO11 can recognize different types of objects, including “person” and various animals like “cat,” “dog,” and “horse.” The system produces a confidence score with each detection, indicating how certain it is about what it found (Figure 1 and Figure 2). We’ve been running the YOLO11 model using a Python script in JupyterHub which is a cloud-based interactive computing environment.

Overall, we’ve found that while YOLO11 offers speed and scalability, challenges remain regarding factors such as reliability and the need for technological expertise. These emerging findings have important implications for visual sociologists and museum practitioners as they consider integrating object detection into their work. Below, students in the Visual Sociology course offer reflections on their experiences and insights from the project.

Delaney Burkhard ’25

Below, you will find my hand coded annotation and YOLO11’s detections for the print Le Vice Supreme (Figure 3 and Figure 4). When I hand coded this image, I found myself going back and forth on whether to label some objects as persons or animals. The two foregrounded figures are skeletons: one holding a decapitated skull and another stepping out of a coffin, holding a fan. The skeletons stand on top of what appears to be a stone featuring a skeletal wolf. I found myself at an impasse – my labels didn’t specify whether the people or animals were meant to be alive. Do skeletons count as people? What about inscriptions of skeletal wolves on stones – do those count as animals? I hadn’t anticipated the issue of human-like beings and opted to skip this print and return to it the next day. After some reflection, I decided that my benchmark for human-like beings would be whether I thought the artist intended for the object to be read as a person. In the end, I identified seven animals, the crows seen on either side of the two figures, and no people. I decided anthropomorphic skeletons would not count as people, nor would a skeletal wolf pictured on a stone count as an animal.

When I ran the same image through YOLO11, the model’s machine vision offered no clarity on the issue of human-or-animal-like beings, instead providing the sole label “Giraffe” (i.e. animal) at .72 confidence to the decapitated skeleton. Moreover, the MHCAM did not have tags for this work. The museum’s collection database offers a description that clarifies the print depicts two skeletons, black birds, and a base with the “she-wolf and skeletal Romulus and Remus pictured on it,” a detail I had failed to notice in all my time analyzing this image. This raises several further questions: What is the role of domain knowledge (in this context, art historical knowledge) in object detection? Is there a way we can integrate it into AI software?

Le Vice Supreme [Supreme Vice], Felicien Rops, 1884, annotated by Delaney Burkhard — **Figure 3: Delaney Burkhard’s (Class of 2025) annotation of *Le Vice Supreme [Supreme Vice]*, Felicien Rops, 1884.**

Le Vice Supreme [Supreme Vice], Felicien Rops, 1884, annotated by YOLO 11 — **Figure 4: YOLO11 annotation of *Le Vice Supreme [Supreme Vice]*, Felicien Rops, 1884.**

Even when, as my case study of Le Vice Supreme demonstrates, I disagreed with YOLO11’s labels, it made me think more critically about the decisions I was making in the hand coding process. Throughout my data collection and analysis, YOLO11’s annotations repeatedly drew me toward objects I had overlooked, compelled me to rethink the reasoning behind my coding decisions, and in many ways, made me spend more time with the art. This experience suggests that thinking of AI as a collaborator is the most robust approach to analyzing art, especially in the museum context.

Alison Foong ’25

My findings show that hand coding resulted in the most people detected followed by YOLO11 and museum tags. One confounding factor may be that the YOLO11 tag of person may be broader than what the museum usually tags for. For example, a museum might tag for “woman” or “man” indicating a gendered person rather than simply a person.

Figures 5 and 6 suggest that the range of tags may not be the only factor contributing to differences. In Figure 6 I labeled more objects as human than YOLO11. As I was hand coding, I noticed a stylistic pattern of depicting humans as small circles attached to a slightly oblong shape. In addition to noticing stylistic patterns, I also noticed the depiction of a path and concluded that it was reasonable for humans to be on the path. YOLO11’s lower number of person tags may be due to the more abstract nature of the upper third region or lack of recognition of the smaller details and line weights. Nevertheless, the difference between my labels and YOLO11’s labels encouraged a more critical second analysis of the print by myself. In my view, YOLO11 can provide an entry point to be critical of one’s perspective of art but should not be the sole authority used in analysis.

Gotenyama, Shinagawa, from the series Meisho Edo hyakkei [One hundred famous views of Edo], Utagawa Hiroshige I, annotated by YOLO11 — Figure 5: YOLO11 annotation of Gotenyama, Shinagawa, from the series Meisho Edo hyakkei [One hundred famous views of Edo], Utagawa Hiroshige I; Uoya Eikichi (publisher), 1856 (Ansei 3, forth month of the year of Dragon).

Gotenyama, Shinagawa, from the series Meisho Edo hyakkei [One hundred famous views of Edo], Utagawa Hiroshige I, annotated by Alison Foong — Figure 6: Alison Foong’s (Class of 2025) annotation of of Gotenyama, Shinagawa, from the series Meisho Edo hyakkei [One hundred famous views of Edo], Utagawa Hiroshige I; Uoya Eikichi (publisher), 1856 (Ansei 3, forth month of the year of Dragon).

Madison Newsham ’25

**Figure 7: YOLO11 annotation of Dancing monkey with a Gohei celebrating New Year, Ryūryūkyo Shinsai**

In terms of labeling animals, YOLO11 tended to provide accurate labels about 50% of the time. However, there were far fewer images that included animals than people, so there is less data overall. There is no clear pattern for determining the inaccuracies of YOLO11 detection for animals. In some cases, YOLO11 had no trouble identifying an animal in the image, and in others, it would mislabel or miss the figure. There were a few images in which the figures overlapped, which could have affected the accuracy of labeling.

Of note, however, is that in some cases, while both the hand coding and YOLO11 provided matching ‘animal’ labels, the specific animal detected by YOLO11 was incorrect. In one image, three wolves were labeled as ‘birds;’ in another, a monkey was labeled as ‘teddy bear (Figure 8);’ in another, a sheep was labeled as ‘horse;’ and, lastly, in one image a chicken was labeled as ‘horse.’ None of these images were abstract and all had realistic depictions of the animals. However, in the case of the wolves, the figures were overlapping and clustered and in the case of the monkey, it was adorned with a hat.

Information (Meissoneir), Henry Wolf, 1896 — **Figure 8: Isabella Le’s (Class of 2027)** **annotation of *Information (Meissoneir)*, Henry Wolf, 1896.**

Isabella Le ’27

My findings show that hand coding remains the most accurate method, detecting people in 72% of the prints and animals in 13%. However, from first-hand experience, it was very time consuming to annotate all 100 images. YOLO11 offers moderate accuracy and quickly identified people in 46% of the prints and animals in 13%. The agreement between hand coding and YOLO11 was substantial for person detection, overlapping in 44% of the prints. This demonstrates that AI and human labeling can complement each other. However, the limited overlap in animal detection (only four prints) suggests that both methods have blind spots and miss animals the other detects. This divergence suggests a need to combine approaches to maximize accuracy and perhaps result in less time consumption for humans.

Another discovery was finding out that I made a mistake in counting the number of animals in the Information (Meissoneir) print. I only identified one animal (Figure 7). Not only did I make the mistake of identifying the wrong number of animals, YOLO11 did too, identifying four animals. There are actually two animals. There is a black horse slightly hidden behind the white horse. I didn’t notice until I went back to check what YOLO11 annotated, which made me check and realized we were both wrong. Even so, it helped me correctly identify the number of animals in the image.