Commit Graph

79815 Commits

Author SHA1 Message Date
Filipe Manana
73e339e6ab btrfs: cache sharedness of the last few data extents during fiemap
During fiemap we process all the file extent items of an inode, by their
file offset order (left to right b+tree order), and then check if the data
extent they point at is shared or not. Until now we didn't cache those
results, we only did it for b+tree nodes/leaves since for each unique
b+tree path we have access to hundreds of file extent items. However, it
is also common to repeat checking the sharedness of a particular data
extent in a very short time window, and the cases that lead to that are
the following:

1) COW writes.

   If have a file extent item like this:

                  [ bytenr X, offset = 0, num_bytes = 512K ]
   file offset    0                                        512K

   Then a 4K write into file offset 64K happens, we end up with the
   following file extent item layout:

                  [ bytenr X, offset = 0, num_bytes = 64K ]
   file offset    0                                       64K

                  [ bytenr Y, offset = 0, num_bytes = 4K ]
   file offset   64K                                     68K

                  [ bytenr X, offset = 68K, num_bytes = 444K ]
   file offset   68K                                         512K

   So during fiemap we well check for the sharedness of the data extent
   with bytenr X twice. Typically for COW writes and for at least
   moderately updated files, we end up with many file extent items that
   point to different sections of the same data extent.

2) Writing into a NOCOW file after a snapshot is taken.

   This happens if the target extent was created in a generation older
   than the generation where the last snapshot for the root (the tree the
   inode belongs to) was made.

   This leads to a scenario like the previous one.

3) Writing into sections of a preallocated extent.

   For example if a file has the following layout:

   [ bytenr X, offset = 0, num_bytes = 1M, type = prealloc ]
   0                                                       1M

   After doing a 4K write into file offset 0 and another 4K write into
   offset 512K, we get the following layout:

      [ bytenr X, offset = 0, num_bytes = 4K, type = regular ]
      0                                                      4K

      [ bytenr X, offset = 4K, num_bytes = 508K, type = prealloc ]
     4K                                                          512K

      [ bytenr X, offset = 512K, num_bytes = 4K, type = regular ]
   512K                                                         516K

      [ bytenr X, offset = 516K, num_bytes = 508K, type = prealloc ]
   516K                                                            1M

   So we end up with 4 consecutive file extent items pointing to the data
   extent at bytenr X.

4) Hole punching in the middle of an extent.

   For example if a file has the following file extent item:

   [ bytenr X, offset = 0, num_bytes = 8M ]
   0                                      8M

   And then hole is punched for the file range [4M, 6M[, we our file
   extent item split into two:

   [ bytenr X, offset = 0, num_bytes = 4M  ]
   0                                       4M

   [ 2M hole, implicit or explicit depending on NO_HOLES feature ]
   4M                                                            6M

   [ bytenr X, offset = 6M, num_bytes = 2M  ]
   6M                                       8M

   Again, we end up with two file extent items pointing to the same
   data extent.

5) When reflinking (clone and deduplication) within the same file.
   This is probably the least common case of all.

In cases 1, 2, 4 and 4, when we have multiple file extent items that point
to the same data extent, their distance is usually short, typically
separated by a few slots in a b+tree leaf (or across sibling leaves). For
case 5, the distance can vary a lot, but it's typically the less common
case.

This change caches the result of the sharedness checks for data extents,
but only for the last 8 extents that we notice that our inode refers to
with multiple file extent items. Whenever we want to check if a data
extent is shared, we lookup the cache which consists of doing a linear
scan of an 8 elements array, and if we find the data extent there, we
return the result and don't check the extent tree and delayed refs.

The array/cache is small so that doing the search has no noticeable
negative impact on the performance in case we don't have file extent items
within a distance of 8 slots that point to the same data extent.

Slots in the cache/array are overwritten in a simple round robin fashion,
as that approach fits very well.

Using this simple approach with only the last 8 data extents seen is
effective as usually when multiple file extents items point to the same
data extent, their distance is within 8 slots. It also uses very little
memory and the time to cache a result or lookup the cache is negligible.

The following test was run on non-debug kernel (Debian's default kernel
config) to measure the impact in the case of COW writes (first example
given above), where we run fiemap after overwriting 33% of the blocks of
a file:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdi
   MNT=/mnt/sdi

   umount $DEV &> /dev/null
   mkfs.btrfs -f $DEV
   mount $DEV $MNT

   FILE_SIZE=$((1 * 1024 * 1024  * 1024))

   # Create the file full of 1M extents.
   xfs_io -f -s -c "pwrite -b 1M -S 0xab 0 $FILE_SIZE" $MNT/foobar

   block_count=$((FILE_SIZE / 4096))
   # Overwrite about 33% of the file blocks.
   overwrite_count=$((block_count / 3))

   echo -e "\nOverwriting $overwrite_count 4K blocks (out of $block_count)..."
   RANDOM=123
   for ((i = 1; i <= $overwrite_count; i++)); do
       off=$(((RANDOM % block_count) * 4096))
       xfs_io -c "pwrite -S 0xcd $off 4K" $MNT/foobar > /dev/null
       echo -ne "\r$i blocks overwritten..."
   done
   echo -e "\n"

   # Unmount and mount to clear all cached metadata.
   umount $MNT
   mount $DEV $MNT

   start=$(date +%s%N)
   filefrag $MNT/foobar
   end=$(date +%s%N)
   dur=$(( (end - start) / 1000000 ))
   echo "fiemap took $dur milliseconds"

   umount $MNT

Result before applying this patch:

   fiemap took 128 milliseconds

Result after applying this patch:

   fiemap took 92 milliseconds   (-28.1%)

The test is somewhat limited in the sense the gains may be higher in
practice, because in the test the filesystem is small, so we have small
fs and extent trees, plus there's no concurrent access to the trees as
well, therefore no lock contention there.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:39 +01:00
Filipe Manana
56f5c19920 btrfs: remove useless logic when finding parent nodes
At find_parent_nodes(), at its last step, when iterating over all direct
references, we are checking if we have a share context and if we have
a reference with a different root from the one in the share context.
However that logic is pointless because of two reasons:

1) After the previous patch in the series (subject "btrfs: remove roots
   ulist when checking data extent sharedness"), the roots argument is
   always NULL when using a share check context (struct share_check), so
   this code is never triggered;

2) Even before that previous patch, we could not hit this code because
   if we had a reference with a root different from the one in our share
   context, then we would have exited earlier when doing either of the
   following:

      - Adding a second direct ref to the direct refs red black tree
        resulted in extent_is_shared() returning true when called from
        add_direct_ref() -> add_prelim_ref(), after processing delayed
        references or while processing references in the extent tree;

      - When adding a second reference to the indirect refs red black
        tree (same as above, extent_is_shared() returns true);

      - If we only have one indirect reference and no direct references,
        then when resolving it at resolve_indirect_refs() we immediately
        return that the target extent is shared, therefore never reaching
        that loop that iterates over all direct references at
        find_parent_nodes();

      - If we have 1 indirect reference and 1 direct reference, then we
        also exit early because extent_is_shared() ends up returning true
        when called through add_prelim_ref() (by add_direct_ref() or
        add_indirect_ref()) or add_delayed_refs(). Same applies as when
        having a combination of direct, indirect and indirect with missing
        key references.

   This logic had been obsoleted since commit 3ec4d3238a ("btrfs:
   allow backref search checks for shared extents"), which introduced the
   early exits in case an extent is shared.

So just remove that logic, and assert at find_parent_nodes() that when we
have a share context we don't have a roots ulist and that we haven't found
the extent to be directly shared after processing delayed references and
all references from the extent tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:39 +01:00
Filipe Manana
b629685803 btrfs: remove roots ulist when checking data extent sharedness
Currently btrfs_is_data_extent_shared() is passing a ulist for the roots
argument of find_parent_nodes(), however it does not use that ulist for
anything and for this context that list always ends up with at most one
element.

Since find_parent_nodes() is able to deal with a NULL ulist for its roots
argument, make btrfs_is_data_extent_shared() pass it NULL and avoid the
burden of allocating memory for the unnused roots ulist, initializing it,
releasing it and allocating one struct ulist_node for it during the call
to find_parent_nodes().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:39 +01:00
Filipe Manana
84a7949d40 btrfs: move ulists to data extent sharedness check context
When calling btrfs_is_data_extent_shared() we pass two ulists that were
allocated by the caller. This is because the single caller, fiemap, calls
btrfs_is_data_extent_shared() multiple times and the ulists can be reused,
instead of allocating new ones before each call and freeing them after
each call.

Now that we have a context structure/object that we pass to
btrfs_is_data_extent_shared(), we can move those ulists to it, and hide
their allocation and the context's allocation in a helper function, as
well as the freeing of the ulists and the context object. This allows to
reduce the number of parameters passed to btrfs_is_data_extent_shared(),
the need to pass the ulists from extent_fiemap() to fiemap_process_hole()
and having the caller deal with allocating and releasing the ulists.

Also rename one of the ulists from 'tmp' / 'tmp_ulist' to 'refs', since
that's a much better name as it reflects what the list is used for (and
matching the argument name for find_parent_nodes()).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:39 +01:00
Filipe Manana
61dbb952f0 btrfs: turn the backref sharedness check cache into a context object
Right now we are using a struct btrfs_backref_shared_cache to pass state
across multiple btrfs_is_data_extent_shared() calls. The structure's name
closely follows its current purpose, which is to cache previous checks
for the sharedness of metadata extents. However we will start using the
structure for more things other than caching sharedness checks, so rename
it to struct btrfs_backref_share_check_ctx.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:39 +01:00
Filipe Manana
ceb707da9a btrfs: directly pass the inode to btrfs_is_data_extent_shared()
Currently we pass a root and an inode number as arguments for
btrfs_is_data_extent_shared() and the inode number is always from an
inode that belongs to that root (it wouldn't make sense otherwise).
In every context that we call btrfs_is_data_extent_shared() (fiemap only),
we have an inode available, so directly pass the inode to the function
instead of a root and inode number. This reduces the number of parameters
and it makes the function's signature conform to most other functions we
have.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:39 +01:00
Filipe Manana
a0a5472ad8 btrfs: remove checks for a 0 inode number during backref walking
When doing backref walking to determine if an extent is shared, we are
testing if the inode number, stored in the 'inum' field of struct
share_check, is 0. However that can never be case, since the all instances
of the structure are created at btrfs_is_data_extent_shared(), which
always initializes it with the inode number from a fs tree (and the number
for any inode from any tree can never be 0). So remove the checks.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:38 +01:00
Filipe Manana
c902421927 btrfs: remove checks for a root with id 0 during backref walking
When doing backref walking to determine if an extent is shared, we are
testing the root_objectid of the given share_check struct is 0, but that
is an impossible case, since btrfs_is_data_extent_shared() always
initializes the root_objectid field with the id of the given root, and
no root can have an objectid of 0. So remove those checks.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:38 +01:00
Filipe Manana
206c1d32f3 btrfs: drop redundant bflags initialization when allocating extent buffer
When allocating an extent buffer, at __alloc_extent_buffer(), there's no
point in explicitly assigning zero to the bflags field of the new extent
buffer because we allocated it with kmem_cache_zalloc().

So just remove the redundant initialization, it saves one mov instruction
in the generated assembly code for x86_64 ("movq $0x0,0x10(%rax)").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:38 +01:00
Filipe Manana
b98c6cd59e btrfs: drop pointless memset when cloning extent buffer
At btrfs_clone_extent_buffer(), before allocating the pages array for the
new extent buffer we are calling memset() to zero out the pages array of
the extent buffer. This is pointless however, because the extent buffer
already has every element in its pages array pointing to NULL, as it was
allocated with kmem_cache_zalloc(). The memset() was introduced with
commit dd137dd1f2 ("btrfs: factor out allocating an array of pages"),
but even before that commit we already depended on the pages array being
initialized to NULL for the error paths that need to call
btrfs_release_extent_buffer().

So remove the memset(), it's useless and slightly increases the object
text size.

Before this change:

   $ size fs/btrfs/extent_io.o
      text	   data	    bss	    dec	    hex	filename
     70580	   5469	     40	  76089	  12939	fs/btrfs/extent_io.o

After this change:

   $ size fs/btrfs/extent_io.o
      text	   data	    bss	    dec	    hex	filename
     70564	   5469	     40	  76073	  12929	fs/btrfs/extent_io.o

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:38 +01:00
Filipe Manana
a2853ffc2e btrfs: skip unnecessary delalloc search during fiemap and lseek
During fiemap and lseek (hole and data seeking), there's no point in
iterating the inode's io tree to count delalloc bits if the inode's
delalloc bytes counter has a value of zero, as that counter is updated
whenever we set a range for delalloc or clear a range from delalloc.

So skip the counting and io tree iteration if the inode's delalloc bytes
counter has a value of zero. This helps save time when processing a file
range corresponding to a hole or prealloc (unwritten) extent.

This patch is part of a series comprised of the following patches:

  btrfs: get the next extent map during fiemap/lseek more efficiently
  btrfs: skip unnecessary extent map searches during fiemap and lseek
  btrfs: skip unnecessary delalloc search during fiemap and lseek

The following test was performed on a release kernel (Debian's default
kernel config) before and after applying those 3 patches.

   # Wrapper to call fiemap in extent count only mode.
   # (struct fiemap::fm_extent_count set to 0)
   $ cat fiemap.c
   #include <stdio.h>
   #include <unistd.h>
   #include <stdlib.h>
   #include <fcntl.h>
   #include <errno.h>
   #include <string.h>
   #include <sys/ioctl.h>
   #include <linux/fs.h>
   #include <linux/fiemap.h>

   int main(int argc, char **argv)
   {
            struct fiemap fiemap = { 0 };
            int fd;

            if (argc != 2) {
                    printf("usage: %s <path>\n", argv[0]);
                    return 1;
            }
            fd = open(argv[1], O_RDONLY);
            if (fd < 0) {
                    fprintf(stderr, "error opening file: %s\n",
                            strerror(errno));
                    return 1;
            }

            /* fiemap.fm_extent_count set to 0, to count extents only. */
            fiemap.fm_length = FIEMAP_MAX_OFFSET;
            if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
                    fprintf(stderr, "fiemap error: %s\n",
                            strerror(errno));
                    return 1;
            }
            close(fd);
            printf("fm_mapped_extents = %d\n", fiemap.fm_mapped_extents);

            return 0;
   }

   $ gcc -o fiemap fiemap.c

And the wrapper shell script that creates a file with many holes and runs
fiemap against it:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdi
   MNT=/mnt/sdi

   mkfs.btrfs -f $DEV
   mount $DEV $MNT

   FILE_SIZE=$((1 * 1024 * 1024 * 1024))
   echo -n > $MNT/foobar
   for ((off = 0; off < $FILE_SIZE; off += 8192)); do
           xfs_io -c "pwrite -S 0xab $off 4K" $MNT/foobar > /dev/null
   done

   # flush all delalloc
   sync

   start=$(date +%s%N)
   ./fiemap $MNT/foobar
   end=$(date +%s%N)
   dur=$(( (end - start) / 1000000 ))
   echo "fiemap took $dur milliseconds"

   umount $MNT

Result before applying patchset:

   fm_mapped_extents = 131072
   fiemap took 63 milliseconds

Result after applying patchset:

   fm_mapped_extents = 131072
   fiemap took 39 milliseconds   (-38.1%)

Running the same test for a 512M file instead of a 1G file, gave the
following results.

Result before applying patchset:

   fm_mapped_extents = 65536
   fiemap took 29 milliseconds

Result after applying patchset:

   fm_mapped_extents = 65536
   fiemap took 20 milliseconds    (-31.0%)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:38 +01:00
Filipe Manana
013f9c70d2 btrfs: skip unnecessary extent map searches during fiemap and lseek
If we have no outstanding extents it means we don't have any extent maps
corresponding to delalloc that is flushing, as when an ordered extent is
created we increment the number of outstanding extents to 1 and when we
remove the ordered extent we decrement them by 1. So skip extent map tree
searches if the number of outstanding ordered extents is 0, saving time as
the tree is not empty if we have previously made some reads or flushed
delalloc, as in those cases it can have a very large number of extent maps
for files with many extents.

This helps save time when processing a file range corresponding to a hole
or prealloc (unwritten) extent.

The next patch in the series has a performance test in its changelog and
its subject is:

    "btrfs: skip unnecessary delalloc search during fiemap and lseek"

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:38 +01:00
Filipe Manana
d47704bd1c btrfs: get the next extent map during fiemap/lseek more efficiently
At find_delalloc_subrange(), when we need to get the next extent map, we
do a full search on the extent map tree (a red black tree). This is fine
but it's a lot more efficient to simply use rb_next(), which typically
requires iterating over less nodes of the tree and never needs to compare
the ranges of nodes with the one we are looking for.

So add a public helper to extent_map.{h,c} to get the extent map that
immediately follows another extent map, using rb_next(), and use that
helper at find_delalloc_subrange().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:38 +01:00
Qu Wenruo
88074c8b13 btrfs: raid56: make it more explicit that cache rbio should have all its data sectors uptodate
For Btrfs RAID56, we have a caching system for btrfs raid bios (rbio).

We call cache_rbio_pages() to mark a qualified rbio ready for cache.

The timing happens at:

- finish_rmw()

  At this timing, we have already read all necessary sectors, along with
  the rbio sectors, we have covered all data stripes.

- __raid_recover_end_io()

  At this timing, we have rebuild the rbio, thus all data sectors
  involved (either from stripe or bio list) are uptodate now.

Thus at the timing of cache_rbio_pages(), we should have all data
sectors uptodate.

This patch will make it explicit that all data sectors are uptodate at
cache_rbio_pages() timing, mostly to prepare for the incoming
verification at RMW time.

This patch will add:

- Extra ASSERT()s in cache_rbio_pages()
  This is to make sure all data sectors, which are not covered by bio,
  are already uptodate.

- Extra ASSERT()s in steal_rbio()
  Since only cached rbio can be stolen, thus every data sector should
  already be uptodate in the source rbio.

- Update __raid_recover_end_io() to update recovered sector->uptodate
  Previously __raid_recover_end_io() will only mark failed sectors
  uptodate if it's doing an RMW.

  But this can trigger new ASSERT()s, as for recovery case, a recovered
  failed sector will not be marked uptodate, and trigger ASSERT() in
  later cache_rbio_pages() call.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:38 +01:00
Qu Wenruo
797d74b749 btrfs: raid56: allocate memory separately for rbio pointers
Currently inside alloc_rbio(), we allocate a larger memory to contain
the following members:

- struct btrfs_raid_rbio itself
- stripe_pages array
- bio_sectors array
- stripe_sectors array
- finish_pointers array

Then update rbio pointers to point the extra space after the rbio
structure itself.

Thus it introduced a complex CONSUME_ALLOC() macro to help the thing.

This is too hacky, and is going to make later pointers expansion harder.

This patch will change it to use regular kcalloc() for each pointer
inside btrfs_raid_bio, making the later expansion much easier.

And introduce a helper free_raid_bio_pointers() to free up all the
pointer members in btrfs_raid_bio, which will be used in both
free_raid_bio() and error path of alloc_rbio().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:38 +01:00
Qu Wenruo
ff2b64a22a btrfs: raid56: cleanup for function __free_raid_bio()
The cleanup involves two things:

- Remove the "__" prefix
  There is no naming confliction.

- Remove the forward declaration
  There is no special function call involved.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:38 +01:00
Josef Bacik
765c3fe99b btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCY
Inside of FB, as well as some user reports, we've had a consistent
problem of occasional ENOSPC transaction aborts.  Inside FB we were
seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really
low occurrence rate given the size of our fleet, but it's not nothing.

There are two causes of this particular problem.

First is delayed allocation.  The reservation system for delalloc
assumes that contiguous dirty ranges will result in 1 file extent item.
However if there is memory pressure that results in fragmented writeout,
or there is fragmentation in the block groups, this won't necessarily be
true.  Consider the case where we do a single 256MiB write to a file and
then close it.  We will have 1 reservation for the inode update, the
reservations for the checksum updates, and 1 reservation for the file
extent item.  At some point later we decide to write this entire range
out, but we're so fragmented that we break this into 100 different file
extents.  Since we've already closed the file and are no longer writing
to it there's nothing to trigger a refill of the delalloc block rsv to
satisfy the 99 new file extent reservations we need.  At this point we
exhaust our delalloc reservation, and we begin to steal from the global
reserve.  If you have enough of these cases going in parallel you can
easily exhaust the global reserve, get an ENOSPC at
btrfs_alloc_tree_block() time, and then abort the transaction.

The other case is the delayed refs reserve.  The delayed refs reserve
updates its size based on outstanding delayed refs and dirty block
groups.  However we only refill this block reserve when returning
excess reservations and when we call btrfs_start_transaction(root, X).
We will reserve 2*X credits at transaction start time, and fill in X
into the delayed refs reserve to make sure it stays topped off.
Generally this works well, but clearly has downsides.  If we do a
particularly delayed ref heavy operation we may never catch up in our
reservations.  Additionally running delayed refs generates more delayed
refs, and at that point we may be committing the transaction and have no
way to trigger a refill of our delayed refs rsv.  Then a similar thing
occurs with the delalloc reserve.

Generally speaking we well over-reserve in all of our block rsvs.  If we
reserve 1 credit we're usually reserving around 264k of space, but we'll
often not use any of that reservation, or use a few blocks of that
reservation.  We can be reasonably sure that as long as you were able to
reserve space up front for your operation you'll be able to find space
on disk for that reservation.

So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY.  This
gets used in the case that we've exhausted our reserve and the global
reserve.  It simply forces a reservation if we have enough actual space
on disk to make the reservation, which is almost always the case.  This
keeps us from hitting ENOSPC aborts in these odd occurrences where we've
not kept up with the delayed work.

Fixing this in a complete way is going to be relatively complicated and
time consuming.  This patch is what I discussed with Filipe earlier this
year, and what I put into our kernels inside FB.  With this patch we're
down to 1-2 ENOSPC aborts per week, which is a significant reduction.
This is a decent stop gap until we can work out a more wholistic
solution to these two corner cases.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:38 +01:00
Josef Bacik
7a66eda351 btrfs: move the btrfs_verity_descriptor_item defs up in ctree.h
These are wrapped in CONFIG_FS_VERITY, but we can have the definitions
without verity enabled.  Move these definitions up with the other
accessor helpers.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:37 +01:00
Josef Bacik
890d2b1aa3 btrfs: move btrfs_next_old_item into ctree.c
This uses btrfs_header_nritems, which I will be moving out of ctree.h.
In order to avoid needing to include the relevant header in ctree.h,
simply move this helper function into ctree.c.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename parameters ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:37 +01:00
Josef Bacik
eda517fd0c btrfs: move free space cachep's out of ctree.h
This is local to the free-space-cache.c code, remove it from ctree.h and
inode.c, create new init/exit functions for the cachep, and move it
locally to free-space-cache.c.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:37 +01:00
Josef Bacik
226463d7b1 btrfs: move btrfs_path_cachep out of ctree.h
This is local to the ctree code, remove it from ctree.h and inode.c,
create new init/exit functions for the cachep, and move it locally to
ctree.c.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:37 +01:00
Josef Bacik
956504a331 btrfs: move trans_handle_cachep out of ctree.h
This is local to the transaction code, remove it from ctree.h and
inode.c, create new helpers in the transaction to handle the init work
and move the cachep locally to transaction.c.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:37 +01:00
Josef Bacik
f60acad355 btrfs: move btrfs_print_data_csum_error into inode.c
This isn't used outside of inode.c, there's no reason to define it in
btrfs_inode.h. Drop the inline and add __cold as it's for errors that
are not in any hot path.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:37 +01:00
Josef Bacik
f1e5c6185c btrfs: move flush related definitions to space-info.h
This code is used in space-info.c, move the definitions to space-info.h.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:37 +01:00
Josef Bacik
06d61cb101 btrfs: move btrfs_should_fragment_free_space into block-group.c
This function uses functions that are not defined in block-group.h, move
it into block-group.c in order to keep the header clean.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:37 +01:00
Josef Bacik
390d89ccf6 btrfs: move discard stat defs to free-space-cache.h
These definitions are used for discard statistics, move them out of
ctree.h and put them in free-space-cache.h.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:37 +01:00
Josef Bacik
ed4c491a3d btrfs: move BTRFS_MAX_MIRRORS into scrub.c
This is only used locally in scrub.c, move it out of ctree.h into
scrub.c.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:37 +01:00
Josef Bacik
ad4b63caf5 btrfs: move maximum limits to btrfs_tree.h
We have maximum link and name length limits, move these to btrfs_tree.h
as they're on disk limitations.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:37 +01:00
Josef Bacik
51129b33d3 btrfs: move btrfs_get_block_group helper out of disk-io.h
This inline helper calls btrfs_fs_compat_ro(), which is defined in
another header.  To avoid weird header dependency problems move this
helper into disk-io.c with the rest of the global root helpers.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:36 +01:00
Josef Bacik
4300c58f80 btrfs: move btrfs on-disk definitions out of ctree.h
The bulk of our on-disk definitions exist in btrfs_tree.h, which user
space can use.  Keep things consistent and move the rest of the on disk
definitions out of ctree.h into btrfs_tree.h.  Note I did have to update
all u8's to __u8, but otherwise this is a strict copy and paste.

Most of the definitions are mainly for internal use and are not
guaranteed stable public API and may change as we need. Compilation
failures by user applications can happen.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:36 +01:00
Josef Bacik
4ce76e8e78 btrfs: remove unused BTRFS_IOPRIO_READA
The last user of this definition was removed in patch f26c923860
("btrfs: remove reada infrastructure") so we can remove this definition.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:36 +01:00
Josef Bacik
ea206640a6 btrfs: remove unused BTRFS_TOTAL_BYTES_PINNED_BATCH
This hasn't been used since 138a12d865 ("btrfs: rip out
btrfs_space_info::total_bytes_pinned") so it is safe to remove.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:36 +01:00
Josef Bacik
d60d956eb4 btrfs: remove unused set/clear_pending_info helpers
The last users of these helpers were removed in 5297199a8b ("btrfs:
remove inode number cache feature") so delete these helpers.

The point was for mount options that were applicable after transaction
commit so they could not be applied immediately. We don't have such
options anymore and if we do the patch can be reverted.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:36 +01:00
Peng Hao
c1b078545e btrfs: simplify cleanup after error in btrfs_create_tree
Since leaf is already NULL, and no other branch will go to fail_unlock,
the fail_unlock label is useless and can be removed

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:36 +01:00
Josef Bacik
e5e886bad9 btrfs: add cached_state to read_extent_buffer_subpage
We don't use a cached state here at all, which generally makes sense as
async reads are going to unlock at endio time.  However for blocking
reads we will call wait_extent_bit() for our range.  Since the
lock_extent() stuff will return the cached_state for the start of the
range this is a helpful optimization to have for this case, we'll have
the exact state we want to wait on.  Add a cached state here and simply
throw it away if we're a non-blocking read, otherwise we'll get a small
improvement by eliminating some tree searches.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:36 +01:00
Josef Bacik
123a7f008c btrfs: cache the failed state when locking extents
Currently if we fail to lock a range we'll return the start of the range
that we failed to lock.  We'll then search down to this range and wait
on any extent states in this range.

However we can avoid this search altogether if we simply cache the
extent_state that had the contention.  We can pass this into
wait_extent_bit() and start from that extent_state without doing the
search.  In the most optimistic case we can avoid all searches, more
likely we'll avoid the initial search and have to perform the search
after we wait on the failed state, or worst case we must search both
times which is what currently happens.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:36 +01:00
Josef Bacik
9c5c960463 btrfs: use a cached_state everywhere in relocation
All of the relocation code avoids using the cached state, despite
everywhere using the normal

  lock_extent()
  // do something
  unlock_extent()

pattern.  Fix this by plumbing a cached state throughout all of these
functions in order to allow for less tree searches.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:36 +01:00
Josef Bacik
632ddfa213 btrfs: use cached_state for btrfs_check_nocow_lock
Now that try_lock_extent() takes a cached_state, plumb the cached_state
through btrfs_try_lock_ordered_range() and then use a cached_state in
btrfs_check_nocow_lock everywhere to avoid extra tree searches on the
extent_io_tree.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:36 +01:00
Josef Bacik
83ae4133ac btrfs: add a cached_state to try_lock_extent
With nowait becoming more pervasive throughout our codebase go ahead and
add a cached_state to try_lock_extent().  This allows us to be faster
about clearing the locked area if we have contention, and then gives us
the same optimization for unlock if we are able to lock the range.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:35 +01:00
Marc Zyngier
382b5b87a9 Merge branch kvm-arm64/mte-map-shared into kvmarm-master/next
* kvm-arm64/mte-map-shared:
  : .
  : Update the MTE support to allow the VMM to use shared mappings
  : to back the memslots exposed to MTE-enabled guests.
  :
  : Patches courtesy of Catalin Marinas and Peter Collingbourne.
  : .
  : Fix a number of issues with MTE, such as races on the tags
  : being initialised vs the PG_mte_tagged flag as well as the
  : lack of support for VM_SHARED when KVM is involved.
  :
  : Patches from Catalin Marinas and Peter Collingbourne.
  : .
  Documentation: document the ABI changes for KVM_CAP_ARM_MTE
  KVM: arm64: permit all VM_MTE_ALLOWED mappings with MTE enabled
  KVM: arm64: unify the tests for VMAs in memslots when MTE is enabled
  arm64: mte: Lock a page for MTE tag initialisation
  mm: Add PG_arch_3 page flag
  KVM: arm64: Simplify the sanitise_mte_tags() logic
  arm64: mte: Fix/clarify the PG_mte_tagged semantics
  mm: Do not enable PG_arch_2 for all 64-bit architectures

Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-12-05 14:38:24 +00:00
Bartosz Taudul
7868f93006 udf: Increase UDF_MAX_READ_VERSION to 0x0260
Some discs containing the UDF file system are unable to be mounted,
failing with the following message:

  UDF-fs: error (device sr0): udf_fill_super: minUDFReadRev=260
    (max is 250)

The UDF 2.60 specification [0] states in the section Basic Restrictions
& Requirements (page 10):

  The Minimum UDF Read Revision value shall be at most #0250 for all
  media with a UDF 2.60 file system. This indicates that a UDF 2.50
  implementation can read all UDF 2.60 media. Media that do not have a
  Metadata Partition may use a value lower than #250.

The conclusion is that the discs failing to mount were burned with a
faulty software, which didn't follow the specification. This can be
worked around by increasing UDF_MAX_READ_VERSION to 0x260, to match the
Minimum Read Revision. No other changes are required, as reading UDF
2.60 is backward compatible with UDF 2.50.

[0] http://www.osta.org/specs/pdf/udf260.pdf

Signed-off-by: Bartosz Taudul <wolf@nereid.pl>
Signed-off-by: Jan Kara <jack@suse.cz>
2022-12-05 11:50:44 +01:00
Yang Li
1f5619ed88 xfs: Remove duplicated include in xfs_iomap.c
./fs/xfs/xfs_iomap.c: xfs_error.h is included more than once.
./fs/xfs/xfs_iomap.c: xfs_errortag.h is included more than once.

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3337
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-12-04 09:42:59 -08:00
Wang Yufen
e1fce56490 pstore/ram: Fix error return code in ramoops_probe()
In the if (dev_of_node(dev) && !pdata) path, the "err" may be assigned a
value of 0, so the error return code -EINVAL may be incorrectly set
to 0. To fix set valid return code before calling to goto.

Fixes: 35da60941e ("pstore/ram: add Device Tree bindings")
Signed-off-by: Wang Yufen <wangyufen@huawei.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/1669969374-46582-1-git-send-email-wangyufen@huawei.com
2022-12-02 15:16:59 -08:00
Liu Shixin
6a46bf5588 binfmt_misc: fix shift-out-of-bounds in check_special_flags
UBSAN reported a shift-out-of-bounds warning:

 left shift of 1 by 31 places cannot be represented in type 'int'
 Call Trace:
  <TASK>
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0x8d/0xcf lib/dump_stack.c:106
  ubsan_epilogue+0xa/0x44 lib/ubsan.c:151
  __ubsan_handle_shift_out_of_bounds+0x1e7/0x208 lib/ubsan.c:322
  check_special_flags fs/binfmt_misc.c:241 [inline]
  create_entry fs/binfmt_misc.c:456 [inline]
  bm_register_write+0x9d3/0xa20 fs/binfmt_misc.c:654
  vfs_write+0x11e/0x580 fs/read_write.c:582
  ksys_write+0xcf/0x120 fs/read_write.c:637
  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
  do_syscall_64+0x34/0x80 arch/x86/entry/common.c:80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0033:0x4194e1

Since the type of Node's flags is unsigned long, we should define these
macros with same type too.

Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221102025123.1117184-1-liushixin2@huawei.com
2022-12-02 13:57:04 -08:00
Linus Torvalds
bdaa78c6aa Merge tag 'mm-hotfixes-stable-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
 "15 hotfixes,  11 marked cc:stable.

  Only three or four of the latter address post-6.0 issues, which is
  hopefully a sign that things are converging"

* tag 'mm-hotfixes-stable-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  revert "kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible"
  Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled
  drm/amdgpu: temporarily disable broken Clang builds due to blown stack-frame
  mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths
  mm/khugepaged: fix GUP-fast interaction by sending IPI
  mm/khugepaged: take the right locks for page table retraction
  mm: migrate: fix THP's mapcount on isolation
  mm: introduce arch_has_hw_nonleaf_pmd_young()
  mm: add dummy pmd_young() for architectures not having it
  mm/damon/sysfs: fix wrong empty schemes assumption under online tuning in damon_sysfs_set_schemes()
  tools/vm/slabinfo-gnuplot: use "grep -E" instead of "egrep"
  nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry()
  hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
  madvise: use zap_page_range_single for madvise dontneed
  mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE
2022-12-02 13:39:38 -08:00
Gustavo A. R. Silva
d272e01fa0 ksmbd: replace one-element arrays with flexible-array members
One-element arrays are deprecated, and we are replacing them with flexible
array members instead. So, replace one-element arrays with flexible-array
members in multiple structs in fs/ksmbd/smb_common.h and one in
fs/ksmbd/smb2pdu.h.

Important to mention is that doing a build before/after this patch results
in no binary output differences.

This helps with the ongoing efforts to tighten the FORTIFY_SOURCE routines
on memcpy() and help us make progress towards globally enabling
-fstrict-flex-arrays=3 [1].

Link: https://github.com/KSPP/linux/issues/242
Link: https://github.com/KSPP/linux/issues/79
Link: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602902.html [1]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/Y3OxronfaPYv9qGP@work
2022-12-02 13:14:29 -08:00
Yang Yingliang
11fa7fefe3 chardev: fix error handling in cdev_device_add()
While doing fault injection test, I got the following report:

------------[ cut here ]------------
kobject: '(null)' (0000000039956980): is not initialized, yet kobject_put() is being called.
WARNING: CPU: 3 PID: 6306 at kobject_put+0x23d/0x4e0
CPU: 3 PID: 6306 Comm: 283 Tainted: G        W          6.1.0-rc2-00005-g307c1086d7c9 #1253
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:kobject_put+0x23d/0x4e0
Call Trace:
 <TASK>
 cdev_device_add+0x15e/0x1b0
 __iio_device_register+0x13b4/0x1af0 [industrialio]
 __devm_iio_device_register+0x22/0x90 [industrialio]
 max517_probe+0x3d8/0x6b4 [max517]
 i2c_device_probe+0xa81/0xc00

When device_add() is injected fault and returns error, if dev->devt is not set,
cdev_add() is not called, cdev_del() is not needed. Fix this by checking dev->devt
in error path.

Fixes: 233ed09d7f ("chardev: add helper function to register char devs with a struct device")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Link: https://lore.kernel.org/r/20221202030237.520280-1-yangyingliang@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-02 17:48:59 +01:00
Christophe JAILLET
6e0149a553 9p/fs: Remove unneeded idr.h #include
The 9p fs does not use IDR or IDA functionalities. So there is no point in
including <linux/idr.h>.
Remove it.

Link: https://lkml.kernel.org/r/3d1e0ed9714eaee7e18d9f5b0b4bfa49b00b286d.1669553950.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
[Dominique: reword subject]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-12-02 23:59:15 +09:00
Andreas Gruenbacher
38552ff676 gfs2: Fix and clean up create / evict interaction
When gfs2_create_inode() fails after creating a new inode, it uses the
GIF_FREE_VFS_INODE and GIF_ALLOC_FAILED inode flags to communicate to
gfs2_evict_inode() which parts of the inode need to be deallocated and
destroyed.  In some error cases, the inode ends up being allocated on
disk and then accidentally left behind.  In others, the inode is
partially constructed and then not properly destroyed.  Clean this up by
completely handling the inode deallocation and destruction in
gfs2_evict_inode().

This means that gfs2_evict_inode() may now be faced with partially
constructed inodes, so add the necessary checks to cope with that.  In
particular, make sure that for incompletely constructed inodes, we're
not accessing the buffers backing the on-disk blocks; the contents may
be undefined.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-12-02 15:58:00 +01:00
Andreas Gruenbacher
3d0258bc11 gfs2: Clean up initialization of "ip" in gfs2_create_inode
Initialize variable "ip" earlier so that it can be used interchangeably
with "inode" everywhere.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-12-02 15:58:00 +01:00