Home / The Year in Infrastructure / 2023 Yearbook Articles / Mining Data to Build Digital Twins

Mining Data to Build Digital Twins

Mining the Dark Data of CAD to Build Digital Twins

How an engineering software development firm tackled one of the key challenges facing CAD-to-GIS conversion initiatives.

The Canadian flag, featuring a red background.
icon graphic circuit board for enterprise engineering

CATEGORY: enterprise engineering

This article, written by author Gavin Schrock, was originally published in GoGeomatics
November 10, 2023

In theory, harvesting features from engineering drawings to populate and augment GIS should be a fairly straight forward proposition. However, the reality is that this is seldom easy to execute. Surely, if the linework and symbols in the CAD drawings, distinguished by levels (or layers) and cells (or blocks), if presented in a relatable spatial reference framework, should translate seamlessly to a GIS schema? It is often said that the devil is in the details, but in the case of legacy CAD, the devil can be the inability to recognize the details.

“The thing about CAD standards is that they are not always strictly adhered to, plus they evolve,” said Mark Stefanchuk, Chief Technology Officer, at Phocaz Inc. “For instance, pre-2000, levels were numbered, and later this evolved to level names. The designers have a tendency, especially when they’re rushed, not to use CAD standards. We’ve run into situations where they will drop cells to their base elements, like lines, circles, box text, etc., or they’ll group them together and we lose the attribution that we would prefer to keep.”

There be disconnects with contracted design firms, funded projects that require different standards, and evolving conventional wisdom on level/cell naming schemas. For instance, some engineering entities have expanded standards to distinguish between purely as-designed, and as-built (or record drawing) features. “So that if we need to, again, evolve CAD standards, we have to develop a means of going from what existed years ago, to what exists today, to what we want in the future,” said Stefanchuk. “That happens in pretty much any organization—some have better controls on their standards than others. Certainly, within for instance, department of transportation clients, we do see anomalies, from project to project, people disintegrating or exploding cells.”

software rendering of CAD project plan
A key element of their AI driven process is a virtual agent, or “robot car” (shown in red lower-left) that “drives” the CAD file lanes to detect geometries that would otherwise be missed in a simple extraction of levels/layers and cells/blocks. For instance, a cell of a turn lane symbol that the drafter/designer had exploded, or put on the wrong level.

Dark Data

“One of the things that often comes up during discussions about CAD-GIS migration is why would you want to work with legacy data?” said Mary A. Ramsey, Founding Partner and CEO at Phocaz Inc.

Our thought, and that of our clients is that anybody who has legacy data had invested substantially, even millions of dollars, in getting that data in the first place.” It can be a huge investment, over decades of creating CAD records; it makes sense to at least try to glean as much from those records as possible.

“In the case of departments of transportation (DOTs), that’s taxpayer money that went into it. A substantial investment has been made in these digital assets, now you need to reap more rewards from it. So absolutely, start analyzing the data that’s in there. What can’t be extracted automatically by say, level and cell name, that is “dark data”. Yet it is the drawing, as geometry—imagine the value that is there if it could be extracted fully.” Dark data in infrastructure is hot topic. In the context of infrastructure, dark data refers to otherwise valuable data not readily accessible due to evolving data conventions, lack of adherence to standards, over reliance on institutional knowledge, of poor data management, AI is enabling new ways to mine it. And that is exactly the approach Phocaz took to mine dark data from the CAD archive of transportation sector clients.

Another question is “why not mobile mapping and drones?” Mobile mapping and drone-based data capture has evolved splendidly in recent years. Advances in precision, positional accuracy, automation of feature recognition, and in simplified field operations. But the reality is that to capture tens of thousands of miles of highway corridors would cost many millions, and even so, these technologies could not capture every feature. The sunk costs of the decades of CAD design and record drawings stand out as a potentially cost-effective resource to mine to substantially populate/augment an enterprise GIS (that are evolving into digital twins).

Productivity Enhancement

“We started Phocaz as a means to provide software development services primarily for the computer aided design space,” said Ramsey. “Specifically for civil engineering users; Civil 3D, Bentley MicroStation and OpenRoads (of course, at the time, it would have been InRoads), etc. We were essentially developing and maintaining add-ins for our clients that run on those platforms and foundational products. We continue to do that type of work today, a lot of work for DOTs along those lines, and other infrastructure clients as well.”

Phocaz on their CAD-to-GIS solution when a long-time client, Georgia Department of Transportation (GDOT) came to them with this request: Is it possible to collect data from a CAD file, automate the extraction and put it into our enterprise GIS? This was the genesis of what GDOT dubbed CLIP, for “CAD Level Integration Process”. Phocaz first identified existing tools, some even within the CAD environment, that were designed to do this. “We discovered pretty quickly that these processes were a little bit slow and would not have been practical considering the enormity of the CAD archive the DOT wished to mine. Not to mention how to manage production and the huge amount of data such an undertaking would produce.”

The solution needed to be scalable to meet GDOT, and other large infrastructure client needs. “For instance, GDOT manages 80,000 centerline lane miles and federal aid routes,” said Stefanchuk. “That represents about a third of the highway roads in the state—it’s probably closer to 250,000 miles of lanes in the state.” Georgia is not alone in the opportunity (and challenge) of “mining” so many miles of highway CAD files; take look at the lane mile totals for each of the 50 states. Phocaz began developing AI-powered algorithms, adopted a digital twin approach, and tapped Bentley Systems ProjectWise for production and data management.

The Virtual Robot Car

While at first, the thought was to simply scan the CAD drawings for the low-hanging-fruit of features recognizable by level and cell names. But it worked out better to simply “drive the digital lanes” once, and comprehensively extract features.

The concept was to have the AI examine the drawing by progressing along lanes and capture features as it goes. Almost like driving each lane with a LiDAR/Imaging mobile mapping equipped vehicle (but at a fraction of the cost). But before the AI car can begin its journeys, a consistent spatial environment needs to exist. Fortunately, as Stefanchuk notes, the design approach in CAD has been to work in a model, pull in the references, and to cut sheets from that. So, in almost all cases, the drawing is ready to “drive”. In the case of their DOT clients that work in a DGN (MicroStation) environment, this is where their choice of Bentley ProjectWise proved to be especially well-suited to managing the drawings, extraction progress, and resultant data.


So that if we need to, again, evolve CAD standards, we have to develop a means of going from what existed years ago, to what exists today, to what we want in the future.

ProjectWise is a project management suite from Bentley Systems, that can serve as a hub for data across multiple disciplines and formats, the entire project lifecycle, and enables work in a digital twin environment. As many of their transportation sector clients work primarily in a Bentley environment (e.g., DGN and MicroStation, and related design software packages) it made sense to manage CLIP projects in this suite.

“The CLIP car, or robot car, as we called it, is really a visualized session tool for us, to understand what is happening with our algorithms.” said Stefanchuk. “The end user is never going to see that.” Though I have to say, it was fun to see the robot car depicted in a demonstration. “What they want, ultimately, is the centerline graphics of the features within the GIS environment and the properties assigned to those.”

In order to find out what those properties are at any given point along the highway, Phocaz developed a tool that can look and find those features. The AI gets trained on the various spatial aspects of features, like bike lane markings (that can vary a lot even from county to county), and applies other rules, like how far abreast to look to cover standard right-of-way widths. “We had to conceptualize and visualize what we wanted the algorithms to do,” said Stefanchuk. “We thought about a winding road that runs through the countryside, how we would drive it in the physical world, and what we could see out the front and side windows. Then, how to teach the AI to “travel” down the channelized CAD lanes and learn from what it would likely see.”

“There’s a couple of places where the CLIP/robot car is actually an advantage,” said Stefanchuk. “One of them is we don’t have to collect all of the data at once, we only need to collect what we see at that instance, make decisions on it, pack it away until we’re ready to report on it, and then continue to move down the highway. When we come upon something like a pavement marking, we can use some visual AI models in order to figure out what that pavement marking represents.”

“What we learned from the CLIP project was that we can start with a symbol, like a right turn arrow or a left turn arrow and can teach an AI to detect that,” said Stefanchuk. “But we can make other decisions based on what we can infer from that, like what kind of lane I’m driving on. Are we driving on a right-turn lane, through lane, a left-turn lane, a U-turn lane, and so on.”

Phocaz did not focus on just pavement markings. Using the same approach as they did for pavement markings; they can create a machine learning model for any cell that’s in any cell library. “Our AI brain is a machine learning model (MLM),” said Stefanchuk. “Our software, a separate application from CLIP (called Phorz AI), will guide a user through creating their own MLM starting with one or more cells (symbols) like a turn arrow, bike lane, driveway, culvert, etc. The MLM a user creates can then be applied to detect these objects in any iTwin (digital twin) model. The idea was that we could make it easy enough for anyone to create an MLM that could detect features in a CAD project. In the case of GDOT’s CLIP, that has an MLM that we trained so they don’t need to perform this step.” For other client’s projects, a master model is created, but they leave the door open for any user to augment and teach the AI, as cells and symbols can vary from city to city, county to county, etc.

Phocaz was recently honored as a finalist in the Enterprise Engineering category of annual Year in Infrastructure Going Digital Awards, held in Singapore, October 11-12 2023. At the same event, in the –keynote address of Julien Moutte, Bentley Systems Chief Technology Officer, demonstrated CLIP extracting left turn arrows from CAD highway drawings. “GDOT always believed its CAD drawings could be a rich source of asset data,” said Moutte. “But accessing that data required manually collecting designs and drawings—thousands of them, and then visually inspect each asset, which would take countless hours. To light up the dark data, Phocaz used ProjectWise powered by iTwin to create digital twins that can be more efficiently analyzed using AI with feature detection and spatial referencing. Phocaz went even further, using a novel AI technique to fill in the gaps between the models. They created an AI agent that can virtually drive along the lanes in the digital twin detecting the center lines. With AI automation, the process of extracting the data is no longer time or cost prohibitive for their clients.

Future Applications

“CLIP is a is a unique workflow that we’ve developed to solve this problem for our transportation clients,” said Ramsey. “We are able to start with the context of the prerequisite that we’re working with—roads. So, we wouldn’t necessarily be able to readily apply it to say, architecture. However, once we understand what that context is, we can start thinking about how we collect data from those kinds of designs.”

What applications for infrastructure could this approach be adapted to? Utilities come immediately to mind. There are transmission and distribution networks, and in the case of telephony and communications networks, there are rules-based connectivity elements that could help further refine the analysis of both linear features and types of appurtenances. When it comes to underground utilities, considering the impracticality of physically locating all features, automation in the extraction of CAD features could be invaluable. It is not out of the question that there could be some success, using this kind of solution, for extraction of features from scanned engineering drawings. However, there are spatial reference challenges (scale and positional registration), plus the quality /completeness of raster to vector conversions (although there has been a lot of progress using AI to enhance these as well).

As municipalities, utilities, and campuses seek to build digital twins, the cost of full physical data capture and as-built surveys stands as a roadblock to wider adoption. However, few features were constructed without some kind of design drawings, and at least for those within the past four decades, there are likely CAD drawings that could be mined in this manner. There is huge trove of dark data lurking in the millions of CADS files out there. Time to make better use of it.