Ok, like I said, figured out the issue, but not 100% certain how to fix it properly. All testing/debugging is done using RHEL-7.2 which contains parted-3.1. I'm just going to write down the debug process since I feel others should have to share in the suffering. ;) Jan and I both came to the conclusion that this is the troublesome line in parted: libparted/labels/fdasd.c -- fdasd_get_geometry anc->geo.cylinders = (n_sectors / (anc->geo.heads * anc->geo.sectors * (dev->sector_size / dev->phys_sector_size))); These are the values that are set at the time of the crash: n_sectors: 0 anc->geo.heads: 15 anc->geo.sectors: 12 dev->sector_size: 512 dev->phys_sector_size: 4096 So dev->sector_size / dev->phys_sector_size is zero, which causes the FPE. Thanks to Brian, we already know from this conditional: if (fstat (f, &st) == 0 && S_ISREG (st.st_mode)) { that parted shouldn't even be running the code which follows (which includes the aforementioned troublesome line), since "st" shouldn't be a regular file. So that means "f" -- the file descriptor passed in to fdasd_get_geometry() -- is for some reason pointing to a regular file. Jan confirmed this when he came up with a reproducer outside of the installation environment. I've already attached it, but for a simplified view, it's basically the following: dev = ped_device_get(argv[1]); disk = ped_disk_new(dev); read_f = open("/tmp/cc", O_RDONLY); printf("copy\n"); copy = ped_disk_duplicate(disk); Now the issue of this being a timing problem (as we originally suspected) was eliminated; this causes libparted to crash every single time. Of particular note is the call to open a random/trash file. I did some playing around and discovered that this configuration: read_f = open("/tmp/cc", O_RDONLY); dev = ped_device_get(argv[1]); disk = ped_disk_new(dev); printf("copy\n"); copy = ped_disk_duplicate(disk); did not reproduce the problem. Ok -- back to the original configuration, only let's change the file we're opening: dev = ped_device_get(argv[1]); disk = ped_disk_new(dev); read_f = open("/dev/dasda", O_RDONLY); printf("copy\n"); copy = ped_disk_duplicate(disk); Opening "/dev/dasda" (RO) did not produce the problem at all. So this verifies parted is passing around a bad file descriptor when we open a file between calls to ped_disk_new() and ped_disk_duplicate(). But now we also have a candiate for the bad function: ped_disk_duplicate() (libparted/disk.c). Well, ped_disk_duplicate() is a pretty abstracted piece of work, so it's pretty unlikely to be *the* function causing problems, but something it's calling sure is. So, next plan of attack: we know fdasd_get_geometry() is being passed the bad fd. So let's grep through parted to see what even calls fdasd_get_geometry() -- fortunately it's not a whole lot. We have dasd_read(), dasd_write(), dasd_probe(), and dasd_alloc_metadata() (all in libparted/labels/dasd.c). I forgot to mention that at this point, I'd compiled parted with -D DEBUG_DASD: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=10820819 This sets the PDEBUG macro(?) in parted, which is pretty useful for debugging DASD issues since it dumps a heck of a lot more information. I was able to eliminate dasd_read() and dasd_write() as candidates for the bad function, and via PDEBUG output and a few breakpoints in gdb, I eventually narrowed in on dasd_alloc_metadata(). (In retrospect, this seems totally obvious from looking at the backtrace in comment 12, but oh well.) So now we have our problem function: dasd_alloc_metadata(). There are about 50 lines of code from the start of the function and when fdasd_get_geometry() gets called. To summarize thus far, this is what is now known: * dasd_alloc_metadata() is our problem function. * fdasd_get_geometry is getting passed a bogus file descriptor from dasd_alloc_metadata, as such: fdasd_get_geometry(disk->dev, &anchor, arch_specific->fd). This "arch_specific->fd" is what we are concerned with; why is it even pointing to a bad fd? * To be clear, if "disk->dev" is our PedDevice (passed into dasd_alloc_metadata()), the suspect variable is then "disk->dev->arch_specific->fd". Ok, knowing this, I re-ran gdb and just set my breakpoint at dasd_alloc_metadata() this time. I counted from last time, and this function is called a total of four times before the fatal fifth call, when it leads to the crash. So I "continued" straight through the first four iterations and began inspecting things in that fateful fifth call. Breakpoint 1, dasd_alloc_metadata (disk=0x800038a0) at dasd.c:944 944 { (gdb) p disk $1 = (PedDisk *) 0x800038a0 (gdb) p disk->dev->arch_specific $2 = (void *) 0x800041a0 (gdb) p disk->dev->arch_specific->fd Attempt to dereference a generic pointer. (gdb) p disk->dev $3 = (PedDevice *) 0x80004100 (gdb) p disk->dev->arch_specific $4 = (void *) 0x800041a0 (gdb) p disk->dev->path $5 = 0x80004180 "/dev/dasda" (gdb) p disk->did There is no member named did. (gdb) p disk->read_only There is no member named read_only. (gdb) p disk->dev->read_only $6 = 0 (gdb) p disk->dev->did $7 = 0 (gdb) p disk->dev->boot_dirty $8 = 0 (gdb) p disk->dev->external_mode $9 = 0 (gdb) p disk->dev->dirty $10 = 0 (gdb) p disk->dev->open_count $11 = 0 (gdb) p disk->dev->read_only $12 = 0 It took me a long time and a couple rabbit-holes for this to stand out to me, but take a look at "disk->dev->open_count", which shows a value of 0. This should be non-zero, right? *At least* 1 since this "/dev/dasda" was opened when ped_disk_new() was called....right? (*Note: after digging through parted some more after this, I think it's expected for "dev->open_count" to be 0 here since that value is decremented when the device is closed, but at the time I expected it should be > 0....eventually it led me down the right path anyway though. ;) ) I don't remember my thought process at this point, but I consulted with a friend, who said it sounded as though parted was closing the fd it originally opened when ped_disk_new() was called (which is not weird or wrong), and when "arch_specific" has all its various attrs filled out in this line of dasd_alloc_metadata(): arch_specific = LINUX_SPECIFIC (disk->dev); the *same* fd was being used from parted's memory. So parted is relying on the same fd being valid even after closing it, which is a mistake. Remember the random open() call we injected between ped_disk_new() and ped_disk_duplicate() in the reproducer? It re-used the fd that parted closed after ped_disk_new() was finished running. More specifically -- and this is why we see parted running code that it absolutely should not be -- fdasd_get_geometry() is calling fstat (AKA trying to open) on this now-erroneous fd. What should really be happening here then? Well, parted has a function, ped_device_open(), which properly handles all of this, it turns out. If you read the comment for ped_device_open, you'll see why it's so important that this is called, and not regular ol' open -- it's arch-dependent and it may allocate resources. More importantly, let's look at the actual code: libparted/device.c -- ped_device_open int ped_device_open (PedDevice* dev) { int|status; PED_ASSERT (dev != NULL); PED_ASSERT (!dev->external_mode); if (dev->open_count) status = ped_architecture->dev_ops->refresh_open (dev); else status = ped_architecture->dev_ops->open (dev); if (status) dev->open_count++; return status; } Maybe it's that I'm unfamiliar with parted, but it takes abstraction to another level. It made things a bit tough to debug (for me), but what is important here is a safety/sanity check of "dev->open_count". If it doesn't exist/is zero, then call ped_architecture->dev_ops->open(dev). So what does that actually evaluate to? Took me a while to figure that out, but it calls _device_open(). Now this is a really important function, and shows us exactly why we need to be calling ped_device_open(), most notably these lines: libparted/arch/linux.c -- _device_open() retry: arch_specific->fd = open (dev->path, flags); if (arch_specific->fd == -1) { char* rw_error_msg = strerror (errno); arch_specific->fd = open (dev->path, RD_MODE); if (arch_specific->fd == -1) { if (ped_exception_throw ( PED_EXCEPTION_ERROR, PED_EXCEPTION_RETRY_CANCEL, _("Error opening %s: %s"), dev->path, strerror (errno)) != PED_EXCEPTION_RETRY) { return 0; } else { goto retry; The very first thing parted does is set "arch_specific->fd" -- properly! So if whatever the current fd value was before this point is now stale, it won't matter -- "dev->path" has always pointed to "/dev/dasda" here, that doesn't change. If ped_device_open() was called earlier, "dev->open_count" would've evaluated as 0, and _device_open() would've gotten called. It would've properly set "arch_specific->fd", and parted wouldn't have crashed. Jan tested a quick and dirty "fix" to verify that calling ped_device_open() (AKA properly resetting "arch_specific->fd") fixes this issue: diff --git a/libparted/labels/dasd.c b/libparted/labels/dasd.c index 4d533cf..4b15dbb 100644 --- a/libparted/labels/dasd.c +++ b/libparted/labels/dasd.c @@ -955,6 +955,8 @@ dasd_alloc_metadata (PedDisk* disk) PED_ASSERT (disk != NULL); PED_ASSERT (disk->dev != NULL); + ped_device_open(disk->dev); + arch_specific = LINUX_SPECIFIC (disk->dev); disk_specific = disk->disk_specific; @@ -1020,6 +1022,8 @@ dasd_alloc_metadata (PedDisk* disk) } } + ped_device_close(disk->dev); + ped_constraint_destroy (constraint_any); return 1; Ok, so it's nice we know what needs to happen to fix it. But how do we fix it? We obviously need to call ped_device_open() somewhere, as in the above test fix, to make sure all of the device's attributes are up-to-date and correct. The problem now is where do we call ped_device_open()? dasd_alloc_metadata() seems the obvious choice, until you realize other functions are very likely passing around the stale fd as well, e.g. dasd_read(), dasd_write(), and dasd_probe(). Then if you grep through parted for "arch_specific->fd" you'll see that the problem isn't localized to DASD code -- it's everywhere, all throughout libparted/arch/linux.c. All of these abstracted functions are blindly passing around "arch_specific->fd" without checking first that it's still valid. If you look in functions like _flush_cache() and linux_write() the implications of passing a bogus fd are kind of serious. If you look at functions in libparted/arch/linux.c like init_ide(), init_dasd(), etc. (any initialize function, basically), they do the right thing though: if (!_device_open_ro (dev)) goto error; This opens the device (RO) properly, so the fd is set. So these initialize functions are all ok, it's just the helper functions that reference "arch_specific->fd" where we need to be concerned. I'm not going to bother listing all of them, but I wanted to do one more thing: verify that these helper functions are indeed at risk for passing around a bogus fd. I added a few PED_ASSERT statements along the likes of what's in ped_device_open() and ped_device_close(): PED_ASSERT (dev != NULL); PED_ASSERT (!dev->external_mode); PED_ASSERT (dev->open_count > 0); to all of the helper functions in libparted/arch/linux.c which reference arch_specific->fd (though checking dev->external_mode wasn't necessary in all of them). Those RPMs with all of the PED_ASSERTs can be found in a scratch build here for all arches: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=10821597 I only tested this on s390x so far by launching anaconda. Anaconda is a perfect choice for this testing since it's constantly opening tons of files (logs, icons, devices, etc.). It also uses libparted via the pyparted package. So I injected this PED_ASSERT-laden parted to an updates.img and launched anaconda. Parted crashed very quickly on a PED_ASSERT: Starting installer, one moment... anaconda 21.48.22.56-1 for anaconda bluesky (pre-release) started. * installation log files are stored in /tmp during the installation * shell is available in second TMUX pane (ctrl+b, then press 2) * if the graphical installation interface fails to start, try again with the inst.text bootoption to start text installation * when reporting a bug add logs from /tmp as separate text/plain attachments Backtrace has 20 calls on stack: 20: /lib64/libparted.so.2(ped_assert+0x50) [0x3fff5b8cfc0] 19: /lib64/libparted.so.2(+0x1a180) [0x3fff5b97180] 18: /lib64/libparted.so.2(+0x1a274) [0x3fff5b97274] 17: /lib64/libparted.so.2(+0x1c1c6) [0x3fff5b991c6] 16: /lib64/libparted.so.2(ped_device_get+0xca) [0x3fff5b8d2aa] 15: /usr/lib64/python2.7/site-packages/_pedmodule.so(py_ped_device_get+0x56) [0x3fff5bf173e] 14: /lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x63c8) [0x3fffd6a0438] 13: /lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xa96) [0x3fffd6a1c4e] 12: /lib64/libpython2.7.so.1.0(+0x7ab04) [0x3fffd61db04] 11: /lib64/libpython2.7.so.1.0(PyObject_Call+0x5c) [0x3fffd5f3bc4] 10: /lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0xdde) [0x3fffd69ae4e] 9: /lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xa96) [0x3fffd6a1c4e] 8: /lib64/libpython2.7.so.1.0(+0x7ab04) [0x3fffd61db04] 7: /lib64/libpython2.7.so.1.0(PyObject_Call+0x5c) [0x3fffd5f3bc4] 6: /lib64/libpython2.7.so.1.0(+0x620a4) [0x3fffd6050a4] 5: /lib64/libpython2.7.so.1.0(PyObject_Call+0x5c) [0x3fffd5f3bc4] 4: /lib64/libpython2.7.so.1.0(+0xb4e0e) [0x3fffd657e0e] 3: /lib64/libpython2.7.so.1.0(+0xb3708) [0x3fffd656708] 2: /lib64/libpython2.7.so.1.0(PyObject_Call+0x5c) [0x3fffd5f3bc4] 1: /lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x114e) [0x3fffd69b1be] Aborted (core dumped) I am pretty certain these PED_ASSERTs would trigger a crash on x86_64 (and other arches) as well. Which brings us to the problem of fixing this. It's easy to assume calling ped_device_open() and ped_device_close() at the beginning and end of these helper functions will fix the issue, but I'm not sure that's actually the safest way to fix this. That would introduce bugs if any function we didn't find has always run in series and expects the device to be open. (Yes, this does mean those functions are buggy as well, since they assume the device is open, and they should probably be fixed.) Plus, since this is a library, we don't know everyone's order of operations; and in partition land, even an unexpected cache flush could subtly introduce bugs. One thing which remains a bit vague and unclear to me is "dev->open_count"; I feel like something isn't quite right with it, and I feel like it may (or could?) impact this bugfix. _device_close() uses unistd.h's close (which flat out invalidates the fd) then only does a dev->open_count--. So it sounds like "dev->open_count" could still be > 0 despite the fd being trashed. ped_device_close() is the only function which seems to do things right; it checks the status of "dev->open_count" and only closes the device if that will be 0. I think maybe _device_close() may also need to be patched to properly handle "open_count". I'm convinced this bug triggers on x86_64 and other arches as well, but it's probably in extremely subtle ways. The only reason it crashes here in 7.2 is because I added LDL DASD support in anaconda, which enabled a block of LDL-specific DASD code in parted to run....a block of code that calls fdasd_get_geometry(), which runs code it never should due to this bad fd and triggers a FPE. As for why/how this bug presents as we see it in 7.2 anaconda....well, anaconda more than likely opened a file right before dasd_alloc_metadata() was called for an LDL DASD. This file was assigned to the fd that once referenced the actual device. "Which file" doesn't even matter....it could've been something as simple as the icon we display on the dasdfmt dialog, or any log file.... Anyway, I really hope this makes sense....